In this paper, we introduce a brand-new localization pipeline designed to comprehensively leverage both hand-crafted and learned features, operating at two distinct levels (point-level and object-level) while simultaneously recovering scene scale from the RGB input. The pipeline integrates a learned globally consistent descriptor matching process for initial camera pose estimation, followed by a pose optimization phase that synergistically combines various features. To generate learned descriptors, we propose a siamese Globally Consistent Feature Descriptor Network (GCFDNet), which accepts a pair of images, Inertial Measurement Unit (IMU) data, and pose sequences as inputs, producing both the image descriptors and the relative camera pose as outputs. The strengths of GCFDNet manifest in two key aspects. First, by incorporating a spatial-to-temporal feature fusion module, GCFDNet enhances relative pose regression, meanwhile enabling accurate scene scale estimation. Second, we devise a loss function that balances descriptor similarity and distance, thereby improving the quality of descriptor learning. Using the initial camera poses derived from GCFDNet, we establish data associations across multiple frames and subsequently propose a combined Bundle Adjustment (BA) optimization framework that integrates hand-crafted features, learned descriptors, and semantic objects. To evaluate the localization performance, we conduct experiments across diverse datasets, including EuRoC, ScanNet, 7 Scenes, TUM RGB-D, and Bonn. The results demonstrate state-of-the-art performance in both static and dynamic scenes, outperforming existing methods. Additionally, we present ablation studies on GCFDNet and the combined BA process to further substantiate the efficacy of our approach.
Wang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: