What question did this study set out to answer?

The aim is to enhance the accuracy of camera pose estimation through a new scene coordinate regression method.

March 28, 2026

Enhanced Visual Relocalization: A Cross-Modal Scene Coordinate Regression Approach

Key Points

The aim is to enhance the accuracy of camera pose estimation through a new scene coordinate regression method.
Introduced a cross-modal feature detection-based scene coordinate regression network (CFDN).
Incorporated a randomization technique to combine image features with pixel positions and camera intrinsics.
Developed a novel cross-modal feature detection loss with explicit 3D geometric constraints.
Employed contrastive learning alongside traditional 2D reprojection loss for refinement.
Achieved relocalization accuracy of 97.2% on the indoor 7Scenes dataset.
Reached a relocalization accuracy of 99.9% on the 12Scenes dataset.
Reduced average median pose error to 17 cm/0.2° in the outdoor Cambridge landmarks dataset.
Outperformed existing baseline methods without requiring 3D models or depth maps.

Abstract

Visual relocalization finds application across a multitude of domains. Within this realm, scene coordinate regression methods are particularly noteworthy, as they bypass traditional intermediate steps and directly estimate camera pose by regressing 2D–3D point correspondences. However, the model is limited to relying solely on reprojection constraints and is challenged with the task of implicitly triangulating points. Without the guidance of a ground-truth 3D point cloud, the model’s ability to achieve high positioning accuracy is compromised. In this study, we address the challenge by incorporating the concept of cross-modal feature detection loss into our network architecture. We introduce cross-modal feature detection-based scene coordinate regression network (CFDN), a novel network that integrates a randomization technique to blend image-derived features with corresponding pixel positions, camera intrinsics, and ground-truth poses. This integration effectively mitigates correlated gradients, thereby significantly enhancing the efficiency of the training process. The network culminates in a regression layer that maps 2D pixel coordinates to their corresponding 3D scene coordinates with high precision. Notably, we have engineered a novel cross-modal feature detection loss by introducing explicit 3D geometric constraints based on the idea of contrastive learning on top of the 2D reprojection loss to refine the accuracy of scene regression. Empirical results demonstrate that our method achieves state-of-the-art performance. Specifically, CFDN achieves a relocalization accuracy of 97.2% and 99.9% on the indoor 7Scenes and 12Scenes data sets, respectively. In the outdoor Cambridge landmarks data set, it reduces the average median pose error to 17 cm/0.2°, outperforming existing baselines while maintaining a compact model footprint without requiring 3D models or depth maps for supervision.

Bookmark

Enhanced Visual Relocalization: A Cross-Modal Scene Coordinate Regression Approach

Key Points

Abstract

Cite This Study