Most of the recent methods of Visual Place Recognition use models based on or involving selfattention to extract basic features of images, and use specific feature fusion algorithms to obtain highly robust descriptors from them. However, the authors seldom pay attention to the strong discriminative visual attributes of image features themselves. Inspired by the observation that the basic features extracted by the Transformer model have a high degree of locality abstraction in high-dimensional space, we design a pixel-level Space self-awareness mechanism. Our approach explores visual attribute correlations between individual pixels and their spatial neighborhoods while preserving their inherent discriminative properties. By enhancing the manifestation of static discriminative scene semantics embedded in feature representations, our work addresses the persistent research gap in VPR regarding inadequate attention to feature-level visual primitives. Based on Space self-awareness, we construct a simple and efficient token-based feature fusion module called Token Module, which can aggregate a highly robust global descriptor with visual invariant information from the extracted basic image features. Specifically, the Token Module first models the interdependence between a single pixel and its surrounding pixels in the spatial direction. Secondly, the features containing the enhanced visual information are fused with the original features to retain the intrinsic geometric structure attributes of the input image. Then, a specific encoder-decoder unit is designed to interact with the global information of the fused feature map in the channel direction. Finally, GeM head is used to aggregate the interactive feature maps into a global description vector with high robustness. Compared with previous works, our method shows excellent performance advantages, which proves the value of the research direction of mining the strong discriminative visual information contained in the feature itself. At the same time, sufficient ablation experiments also show that the Token Module has good transferability and can be well combined with ViT like models or most CNN architectures.
Hou et al. (Wed,) studied this question.