Vision-Language-Action (VLA) models have recently demonstrated strong generalization capabilities in robotic manipulation by integrating multimodal perception, semantic reasoning, and action generation within unified transformer-based architectures. However, existing VLA systems primarily focus on ground-based manipulation tasks and remain limited in aerial embodied navigation scenarios, where unmanned aerial vehicles (UAVs) must operate under dynamic motion constraints, partial observability, and real-time control requirements. In this paper, we present UAV-VLM-NAV, a unified Vision-Language-Action framework for semantic UAV navigation that integrates visual perception, language grounding, temporal action modeling, and simulation pretraining into a single multimodal architecture. The proposed framework combines CLIP-based visual-language encoding, temporal transformer reasoning, action tokenization, and multi-objective learning to enable language-conditioned UAV trajectory generation and long-horizon semantic navigation. To improve generalization and scalability, we further introduce a simulation pretraining pipeline based on AirSim that automatically generates multimodal trajectory datasets containing RGB observations, UAV states, language instructions, semantic goal representations, and action sequences. Additionally, we formulate a multi-objective training strategy that jointly optimizes action prediction, semantic alignment, temporal smoothness, and autoregressive action-token objectives. Experimental results demonstrate that UAV-VLM-NAV achieves improved navigation success rate, trajectory consistency, and semantic goal understanding compared with conventional imitation-learning and reactive navigation baselines. The proposed framework highlights the potential of extending Vision-Language-Action learning from robotic manipulation toward aerial embodied intelligence and autonomous UAV navigation.
Haotian Gu (Wed,) studied this question.