What question did this study set out to answer?

The research aims to improve visual navigation for UAVs by integrating local and global feature representations using a Siamese-ViT approach.

May 15, 2026Open Access

Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments

Key Points

The research aims to improve visual navigation for UAVs by integrating local and global feature representations using a Siamese-ViT approach.
Proposed a scene matching algorithm utilizing the Siamese-ViT model for feature extraction.
Employed K-means clustering for local feature aggregation and incremental principal component analysis for dimensionality reduction.
Validated the algorithm on the University-1652 dataset and real-world satellite-drone image pairs.
Achieved an average absolute positioning error of 6.2063 m for latitude and 6.7552 m for longitude.
Demonstrated superior performance in Recall and Average Precision compared to existing models.
Successfully conducted flight experiments capturing complex scenes at a flight altitude of 350 m.

Abstract

Visual scene matching navigation (VSMN) for unmanned aerial vehicles (UAVs) boasts advantages such as high precision, high reliability, and autonomy. The biggest challenge lies in the tension between local fine-grained information and global semantics, as well as limited generalization ability in real-world environments. While existing Transformer-based cross-view geolocation methods enhance global context modeling capabilities, they still generally face issues such as high demands on training data and computational resources, insufficient fusion of local fine-grained information and global semantics, and real-time performance in real-world complex environment. To address these problems, we propose a scene matching and localization algorithm based on the Siamese-ViT. For feature extraction, we use the ViT model to extract global features and K-means clustering to aggregate local features. Combined with the global features extracted by the ViT, a robust local–global feature representation vector is generated. For feature matching, incremental principal component analysis (IPCA) is used to reduce the dimensionality of the high-dimensional feature space, and a KD-tree is constructed for fast feature retrieval to improve matching efficiency. We validated our algorithm on the University-1652 dataset and a dataset of real-world satellite-drone image pairs. The results show that our Siamese-ViT outperforms other models in both Recall and AP. We conduct flight experiments in real-world environments, capturing drone images of complex scenes, including farmland, urban buildings, and waterways. The results show that, at a flight altitude of 350 m, our algorithm achieves an average absolute value of 6.2063 m for latitude, 6.7552 m for longitude, and 10.1922 m for horizontal error. Therefore, our Siamese-ViT demonstrates ideal overall positioning accuracy.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Cheng et al. (Wed,) studied this question.

synapsesocial.com/papers/6a06b83de7dec685947aacaa https://doi.org/https://doi.org/10.3390/rs18101556

Bookmark

View Full Paper