Visual relocalization finds application across a multitude of domains. Within this realm, scene coordinate regression methods are particularly noteworthy, as they bypass traditional intermediate steps and directly estimate camera pose by regressing 2D–3D point correspondences. However, the model is limited to relying solely on reprojection constraints and is challenged with the task of implicitly triangulating points. Without the guidance of a ground-truth 3D point cloud, the model’s ability to achieve high positioning accuracy is compromised. In this study, we address the challenge by incorporating the concept of cross-modal feature detection loss into our network architecture. We introduce cross-modal feature detection-based scene coordinate regression network (CFDN), a novel network that integrates a randomization technique to blend image-derived features with corresponding pixel positions, camera intrinsics, and ground-truth poses. This integration effectively mitigates correlated gradients, thereby significantly enhancing the efficiency of the training process. The network culminates in a regression layer that maps 2D pixel coordinates to their corresponding 3D scene coordinates with high precision. Notably, we have engineered a novel cross-modal feature detection loss by introducing explicit 3D geometric constraints based on the idea of contrastive learning on top of the 2D reprojection loss to refine the accuracy of scene regression. Empirical results demonstrate that our method achieves state-of-the-art performance. Specifically, CFDN achieves a relocalization accuracy of 97.2% and 99.9% on the indoor 7Scenes and 12Scenes data sets, respectively. In the outdoor Cambridge landmarks data set, it reduces the average median pose error to 17 cm/0.2°, outperforming existing baselines while maintaining a compact model footprint without requiring 3D models or depth maps for supervision.
Ma et al. (Thu,) studied this question.