Los puntos clave no están disponibles para este artículo en este momento.
Visual Place Recognition (VPR) is essential for robotics and autonomous navigation, yet most methods rely on heavy task-specific training. Existing approaches fall into two main paradigms: single-stage models that learn compact global descriptors, and two-stage pipelines that combine coarse global retrieval with local feature or geometric verification. While effective, both require large annotated datasets and carefully tuned optimization, limiting scalability and cross-domain reuse. We introduce TF-VPR, a new benchmark that tackles a more challenging setting: VPR performed entirely without additional training, where descriptors are generated, refined and matched only at test time. Enabled by recent Vision Foundation Models (VFMs), TF-VPR systematically evaluates how far pretrained VFMs can be pushed for place recognition when used as-is, and provides a standardized protocol for fairly comparing arbitrary VFMs without fine-tuning. To support this, we unify major VPR datasets covering diverse real-world conditions and propose two lightweight, training-free modules: Training-Free Graph-Attention Graph Module (TF-GAM) and Training-Free Cross-Attention Module (TF-CAM). These plug-and-play modules enhance descriptor discriminability and retrieval robustness. Experiments show that TF-VPR exposes new challenges and reveals previously unexplored strengths of VFMs for training-free place recognition. Code and datasets are available at https://github.com/ddfs430/TF-VPR .
Wang et al. (Wed,) studied this question.