What question did this study set out to answer?

The aim is to develop a new camera localization pipeline that combines hand-crafted and learned features for accurate pose estimation and scene scale recovery.

June 7, 2026

Visual Camera Localization by Globally Consistent Descriptor Learning and Combined Bundle Adjustment

Key Points

The aim is to develop a new camera localization pipeline that combines hand-crafted and learned features for accurate pose estimation and scene scale recovery.
Proposed a Siamese Globally Consistent Feature Descriptor Network (GCFDNet) for image descriptor generation.
Implemented a combined Bundle Adjustment optimization framework integrating various features.
Conducted experiments on multiple datasets including EuRoC, ScanNet, and TUM RGB-D.
Achieved state-of-the-art localization performance in both static and dynamic scenes.
GCFDNet significantly enhanced relative pose regression and scene scale estimation.
Outperformed existing methods in the evaluated datasets.

Abstract

In this paper, we introduce a brand-new localization pipeline designed to comprehensively leverage both hand-crafted and learned features, operating at two distinct levels (point-level and object-level) while simultaneously recovering scene scale from the RGB input. The pipeline integrates a learned globally consistent descriptor matching process for initial camera pose estimation, followed by a pose optimization phase that synergistically combines various features. To generate learned descriptors, we propose a siamese Globally Consistent Feature Descriptor Network (GCFDNet), which accepts a pair of images, Inertial Measurement Unit (IMU) data, and pose sequences as inputs, producing both the image descriptors and the relative camera pose as outputs. The strengths of GCFDNet manifest in two key aspects. First, by incorporating a spatial-to-temporal feature fusion module, GCFDNet enhances relative pose regression, meanwhile enabling accurate scene scale estimation. Second, we devise a loss function that balances descriptor similarity and distance, thereby improving the quality of descriptor learning. Using the initial camera poses derived from GCFDNet, we establish data associations across multiple frames and subsequently propose a combined Bundle Adjustment (BA) optimization framework that integrates hand-crafted features, learned descriptors, and semantic objects. To evaluate the localization performance, we conduct experiments across diverse datasets, including EuRoC, ScanNet, 7 Scenes, TUM RGB-D, and Bonn. The results demonstrate state-of-the-art performance in both static and dynamic scenes, outperforming existing methods. Additionally, we present ablation studies on GCFDNet and the combined BA process to further substantiate the efficacy of our approach.

Mark Helpful

Bookmark

Relay