What question did this study set out to answer?

This research aims to improve UAV localization in challenging environments where GNSS is unreliable.

May 3, 2026Open Access

UAV Visual Localization via Multimodal Fusion and Multi-Scale Attention Enhancement

Key Points

This research aims to improve UAV localization in challenging environments where GNSS is unreliable.
Developed a cross-view visual geo-localization model integrating RGBD inputs and multi-scale attention.
Used MiDaS for depth estimation from UAV imagery and combined with RGB for a four-channel input.
Employed IoU-weighted InfoNCE loss to enhance feature alignment during training.
Achieved Recall@1, Recall@5, and Recall@10 scores of 18.12%, 38.83%, and 49.47%, respectively.
Reported an Average Precision (AP) of 28.01 and a top-1 geodesic distance error (Dis@1) of 1052.73 m.
Demonstrated improved robustness and accuracy for UAV localization through geometric and spatial enhancements.

Abstract

For power-grid applications such as transmission corridor inspection, substation asset inspection, and post-disaster emergency repair, reliable UAV self-localization under GNSS-degraded or GNSS-denied conditions is critical to ensuring operational safety and accurate defect geotagging. Due to substantial discrepancies in viewpoint, scale, and geometric structure between oblique UAV images and nadir satellite images, conventional RGB-based cross-view retrieval methods often suffer from unstable alignment and insufficient geometric modeling, particularly in scenarios with repetitive textures and partial overlap. To address these challenges, we propose a cross-view visual geo-localization model that integrates RGBD multimodal inputs with multi-scale attention enhancement. Specifically, MiDaS is used to estimate relative depth from UAV imagery, which is concatenated with RGB to form a four-channel input, while satellite images are padded with an additional zero channel to maintain dimensional consistency. A shared-weight ViTAdapter is adopted to learn joint semantic–geometric representations, and a lightweight Efficient Multi-scale Attention (EMA) module is adopted on spatial feature maps to strengthen multi-scale spatial consistency. In addition, an IoU-weighted InfoNCE loss is employed to accommodate partial matching during training, thereby improving the robustness of feature alignment. Experiments on the GTA-UAV dataset under the cross-area protocol show stable performance across both retrieval and localization metrics. Specifically, Recall@1, Recall@5, and Recall@10 reach 18.12%, 38.83%, and 49.47%, respectively; AP is 28.01 and SDM@3 is 0.53; meanwhile, the top-1 geodesic distance error Dis@1 is 1052.73 m. These results indicate that explicit geometric priors combined with multi-scale spatial enhancement can effectively improve cross-view feature alignment, leading to enhanced robustness and accuracy for localization in challenging power inspection scenarios.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper