For power-grid applications such as transmission corridor inspection, substation asset inspection, and post-disaster emergency repair, reliable UAV self-localization under GNSS-degraded or GNSS-denied conditions is critical to ensuring operational safety and accurate defect geotagging. Due to substantial discrepancies in viewpoint, scale, and geometric structure between oblique UAV images and nadir satellite images, conventional RGB-based cross-view retrieval methods often suffer from unstable alignment and insufficient geometric modeling, particularly in scenarios with repetitive textures and partial overlap. To address these challenges, we propose a cross-view visual geo-localization model that integrates RGBD multimodal inputs with multi-scale attention enhancement. Specifically, MiDaS is used to estimate relative depth from UAV imagery, which is concatenated with RGB to form a four-channel input, while satellite images are padded with an additional zero channel to maintain dimensional consistency. A shared-weight ViTAdapter is adopted to learn joint semantic–geometric representations, and a lightweight Efficient Multi-scale Attention (EMA) module is adopted on spatial feature maps to strengthen multi-scale spatial consistency. In addition, an IoU-weighted InfoNCE loss is employed to accommodate partial matching during training, thereby improving the robustness of feature alignment. Experiments on the GTA-UAV dataset under the cross-area protocol show stable performance across both retrieval and localization metrics. Specifically, Recall@1, Recall@5, and Recall@10 reach 18.12%, 38.83%, and 49.47%, respectively; AP is 28.01 and SDM@3 is 0.53; meanwhile, the top-1 geodesic distance error Dis@1 is 1052.73 m. These results indicate that explicit geometric priors combined with multi-scale spatial enhancement can effectively improve cross-view feature alignment, leading to enhanced robustness and accuracy for localization in challenging power inspection scenarios.
Wang et al. (Sat,) studied this question.