What question did this study set out to answer?

This work aims to enhance the estimation of geographic coordinates from street-view queries by aligning features with aerial images.

April 26, 2026Open Access

Hierarchical feature alignment for cross-view geo-localization

Key Points

This work aims to enhance the estimation of geographic coordinates from street-view queries by aligning features with aerial images.
Developed GeoAlignNet (GANet) framework combining Spatial Structure Attention and Local Representation Refinement.
Implemented attention mechanisms to reduce misalignment due to viewpoint changes and improve feature representation.
Applied a hybrid objective for optimizing embedding space to ensure effective retrieval.
GANet shows competitive performance in cross-view geo-localization tasks against existing methods.
Demonstrates reduced spatial misalignment and improved perception of geometric textures.
Achieved stable optimization despite extreme structural variations.

Abstract

Abstract Cross-view geo-localization aims to estimate the geographic coordinates of a street-view query by matching it with an aerial image database containing geotags. However, this task is fundamentally challenged by extreme viewpoint transformations and scale inconsistencies, which hinder the extraction of stable global structures and fine-grained local details. To address these limitations, we propose the GeoAlignNet (GANet), a hierarchical feature learning framework designed to achieve robust cross-scale and cross-view representation alignment for retrieval-based geo-localization. Specifically, GANet comprises two complementary components. First, the Spatial Structure Attention (SSA) module performs structure-aware aggregation by combining window-based attention with adaptive window partitioning and enhanced positional encoding, enabling the network to capture view-invariant spatial layouts and to alleviate spatial misalignment induced by strong perspective changes. Second, the Local Representation Refinement (LRP) module adopts depthwise separable convolutions and a multi-scale gating mechanism to optimally model fine-grained local feature representations, so as to improve the perception of geometric textures and achieve stable characterization against appearance variations and environmental noise. To further improve retrieval discriminability, we adopt a hybrid objective that jointly enforces intra-class compactness and inter-class separability in the embedding space, facilitating stable optimization under extreme cross-domain structural variations. Extensive experiments demonstrate that GANet achieves competitive performance compared with existing cross-view geo-localization methods, highlighting its strong effectiveness and generalization capability.

AI에게 질문

Bookmark

View Full Paper