What does this research mean for the field?

Coordination strategies, training recipes, and action representations developed for bimanual Vision-Language-Action (VLA) models are transferable to unmanned aerial systems. Novelty: ClaimNovelty.SYNTHESIS. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To review and synthesize research on Vision–Language–Action models in bimanual manipulation and unmanned aerial robotics over the past nine years.

May 28, 2026Open Access

Vision–Language–Action (VLA) Models for Unmanned Aerial Robotics and Bimanual Manipulation: A Review

Key Points

To review and synthesize research on Vision–Language–Action models in bimanual manipulation and unmanned aerial robotics over the past nine years.
Reviewed 183 contributions from 2017 to 2026.
Organized findings into seven dimensions: VLA architectures, training recipes, action representations, bimanual coordination, UAV navigation and control, language grounding, and cross-cutting concerns.
Identified strategies and adaptations for transferring learning between bimanual and unmanned aerial systems.
Found that coordination strategies from bimanual VLAs effectively apply to unmanned aerial systems.
Identified fourteen research directions that connect bimanual manipulation with unmanned aerial robotics.
Showed that action representations developed for one domain can enhance performance in the other.

Abstract

Vision–Language–Action (VLA) models unify visual perception, natural-language understanding, and action generation within a single foundation model, allowing a robot to follow instructions such as “fold the towel” or “fly to the red building” directly from camera images. Because VLAs inherit world knowledge from internet-scale pre-training, they have become the dominant framework for learning-based manipulation, with bimanual coordination serving as the most demanding testbed: two arms with 7+ degrees of freedom each must move in concert to fold, assemble, and reorient objects. Unmanned aerial robotics faces a structurally similar challenge: a drone must coordinate thrust, attitude, and increasingly gripper commands from visual observations under strict latency and payload constraints. This review covers 183 contributions spanning 2017–2026 and organized along seven dimensions: VLA architectures, training recipes, action representations, bimanual coordination (2022–2026), unmanned aerial vehicle (UAV) navigation and control (2017–2026), language grounding, and cross-cutting concerns including memory and world models. We show that the coordination strategies, training recipes, and action representations developed for bimanual VLAs transfer to unmanned aerial systems and identify fourteen research directions across both domains.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper