Abstract Cross-view geo-localization aims to estimate the geographic coordinates of a street-view query by matching it with an aerial image database containing geotags. However, this task is fundamentally challenged by extreme viewpoint transformations and scale inconsistencies, which hinder the extraction of stable global structures and fine-grained local details. To address these limitations, we propose the GeoAlignNet (GANet), a hierarchical feature learning framework designed to achieve robust cross-scale and cross-view representation alignment for retrieval-based geo-localization. Specifically, GANet comprises two complementary components. First, the Spatial Structure Attention (SSA) module performs structure-aware aggregation by combining window-based attention with adaptive window partitioning and enhanced positional encoding, enabling the network to capture view-invariant spatial layouts and to alleviate spatial misalignment induced by strong perspective changes. Second, the Local Representation Refinement (LRP) module adopts depthwise separable convolutions and a multi-scale gating mechanism to optimally model fine-grained local feature representations, so as to improve the perception of geometric textures and achieve stable characterization against appearance variations and environmental noise. To further improve retrieval discriminability, we adopt a hybrid objective that jointly enforces intra-class compactness and inter-class separability in the embedding space, facilitating stable optimization under extreme cross-domain structural variations. Extensive experiments demonstrate that GANet achieves competitive performance compared with existing cross-view geo-localization methods, highlighting its strong effectiveness and generalization capability.
Su et al. (Fri,) studied this question.