What question did this study set out to answer?

This research aims to construct a benchmark suite for evaluating aerospace embodied intelligence in UAVs.

June 4, 2026

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied Foundation Models

Key Points

This research aims to construct a benchmark suite for evaluating aerospace embodied intelligence in UAVs.
Developed AeroSimulator with four urban flight simulation scenes.
Created AerialAgent-Ego15k and CyberAgent-Ego500k datasets for pre-training.
Defined five downstream tasks and built instruction datasets for fine-tuning.
SkyAgent outperforms mainstream models by 8.52% across four tasks.
Demonstrated limitations of existing visual-language models in aerospace tasks.
Established a comprehensive evaluation system with SkyAgent-Eval.

Abstract

Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied foundation model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. Background However, existing embodied foundation models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored, lacking systematic and standardized benchmark suites. Aim To address this gap, this study aims to construct a comprehensive benchmark suite, AeroVerse, to facilitate the simulation, pre-training, finetuning, and evaluation of aerospace embodied foundation models. Innovations We develop AeroSimulator, a simulation platform that encompasses four realistic urban scenes for UAV flight simulation. Additionally, we construct the first large-scale real-world image-text pre-training dataset from a first-person UAV perspective, AerialAgent-Ego15k, and create a virtual image-text-pose alignment dataset, CyberAgent-Ego500k, to facilitate the pre-training of the aerospace embodied foundation model. We clearly define five downstream tasks for the first time, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and have constructed corresponding instruction datasets for fine-tuning. We also develop SkyAgent-Eval, a downstream task evaluation system based on GPT-4. Furthermore, we propose SkyAgent, the first UAV-agent large model integrating "perception-reasoning-navigating-planning", which incorporates an aerospace embodied chain-of-thought mechanism and a multitask curriculum learning strategy. Results By benchmarking ten mainstream models, our results reveal the significant limitations of existing 2D/3D visual-language models in complex aerospace embodied tasks and demonstrate the superior performance of SkyAgent, which outperforms existing methods by an average of 8.52% across four core tasks, underscoring the necessity and contribution of our work. The AeroVerse benchmark suite will be released to the community to promote exploration and development of aerospace embodied intelligence.

Bookmark

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied Foundation Models

Key Points

Abstract

Cite This Study