What question did this study set out to answer?

This research aims to develop a Vision-Language-Action framework for enhancing UAV navigation capabilities.

May 22, 2026Open Access

UAV-VLA: Multimodal Vision-Language-Action Pretraining for Autonomous Drone Navigation

Key Points

This research aims to develop a Vision-Language-Action framework for enhancing UAV navigation capabilities.
Developed UAV-VLM-NAV unifying visual perception, language grounding, and action modeling.
Introduced a simulation pretraining pipeline using AirSim to create multimodal trajectory datasets.
Applied a multi-objective training strategy to optimize action prediction and semantic alignment.
UAV-VLM-NAV achieved a navigation success rate of 85%, compared to 70% for conventional baselines.
Demonstrated enhanced trajectory consistency with a 30% improvement over reactive methods.
Improved semantic goal understanding with a 20% increase in correct goal recognition rates.

Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong generalization capabilities in robotic manipulation by integrating multimodal perception, semantic reasoning, and action generation within unified transformer-based architectures. However, existing VLA systems primarily focus on ground-based manipulation tasks and remain limited in aerial embodied navigation scenarios, where unmanned aerial vehicles (UAVs) must operate under dynamic motion constraints, partial observability, and real-time control requirements. In this paper, we present UAV-VLM-NAV, a unified Vision-Language-Action framework for semantic UAV navigation that integrates visual perception, language grounding, temporal action modeling, and simulation pretraining into a single multimodal architecture. The proposed framework combines CLIP-based visual-language encoding, temporal transformer reasoning, action tokenization, and multi-objective learning to enable language-conditioned UAV trajectory generation and long-horizon semantic navigation. To improve generalization and scalability, we further introduce a simulation pretraining pipeline based on AirSim that automatically generates multimodal trajectory datasets containing RGB observations, UAV states, language instructions, semantic goal representations, and action sequences. Additionally, we formulate a multi-objective training strategy that jointly optimizes action prediction, semantic alignment, temporal smoothness, and autoregressive action-token objectives. Experimental results demonstrate that UAV-VLM-NAV achieves improved navigation success rate, trajectory consistency, and semantic goal understanding compared with conventional imitation-learning and reactive navigation baselines. The proposed framework highlights the potential of extending Vision-Language-Action learning from robotic manipulation toward aerial embodied intelligence and autonomous UAV navigation.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper