Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied foundation model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. Background However, existing embodied foundation models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored, lacking systematic and standardized benchmark suites. Aim To address this gap, this study aims to construct a comprehensive benchmark suite, AeroVerse, to facilitate the simulation, pre-training, finetuning, and evaluation of aerospace embodied foundation models. Innovations We develop AeroSimulator, a simulation platform that encompasses four realistic urban scenes for UAV flight simulation. Additionally, we construct the first large-scale real-world image-text pre-training dataset from a first-person UAV perspective, AerialAgent-Ego15k, and create a virtual image-text-pose alignment dataset, CyberAgent-Ego500k, to facilitate the pre-training of the aerospace embodied foundation model. We clearly define five downstream tasks for the first time, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and have constructed corresponding instruction datasets for fine-tuning. We also develop SkyAgent-Eval, a downstream task evaluation system based on GPT-4. Furthermore, we propose SkyAgent, the first UAV-agent large model integrating "perception-reasoning-navigating-planning", which incorporates an aerospace embodied chain-of-thought mechanism and a multitask curriculum learning strategy. Results By benchmarking ten mainstream models, our results reveal the significant limitations of existing 2D/3D visual-language models in complex aerospace embodied tasks and demonstrate the superior performance of SkyAgent, which outperforms existing methods by an average of 8.52% across four core tasks, underscoring the necessity and contribution of our work. The AeroVerse benchmark suite will be released to the community to promote exploration and development of aerospace embodied intelligence.
Yao et al. (Thu,) studied this question.