Cross-view geo-localization tries to find the matching place in large satellite or aerial pictures from photos taken at ground level, which is useful for applications like self-driving cars, flying drones, and adding virtual objects to real city scenes. However, the traditional deep learning hybrid CNN-Transformer architecture and complex geometric submodules result in a large computational overhead, making it difficult to apply in real-time on resource-constrained devices. To make it light, fast, and accurate, this paper suggests an effective way to make a state-space model for cross-view geo-localization tasks. The model replaces the traditional self-attention structure with a state-space vision backbone, lowering the sequence modeling complexity from quadratic to linear and greatly accelerating the inference process; it devises a channel-group aggregation strategy without any learnable parameters, producing a comprehensive yet lightweight representation, and introduces a dynamic difficulty-aware loss function that assigns varying weights to all negative samples within a batch according to their similarities, greatly improving the efficiency of hard-negative sample mining and the quality of convergence. The results on the authoritative public datasets CVUSA and CVACT indicate that our method has high accuracy and low computational complexity, providing a feasible approach for the lightweight design of more powerful cross-view geolocation models in the future.
Tao et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: