What type of study is this?

This is a Experimental Study study.

October 8, 2025Open Access

VLA-MP: A Vision-Language-Action Framework for Multimodal Perception and Physics-Constrained Action Generation in Autonomous Driving

Key Points

VLA-MP achieves high performance across various benchmarks, laying the groundwork for improved autonomous driving.
Best driving scores of 44.3, 63.5, and 78.4 on LangAuto series highlight the framework's effectiveness.
System integrates vision-language alignment with physics-informed action generation for realistic driving scenarios.
Validation through visualization shows the framework’s capacity to follow complex instructions and ensure safety.

Abstract

Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action framework that integrates multimodal Bird’s-Eye View (BEV) perception, vision-language alignment, and a GRU-bicycle dynamics cascade adapter for physics-informed action generation. The system constructs structured environmental representations from RGB images and LiDAR, aligns scene features with natural language instructions through a cross-modal projector and large language model, and converts high-level semantic hidden states outputs into executable and physically consistent trajectories. Experiments on the LMDrive dataset and CARLA simulator demonstrate that VLA-MP achieves high performance across the LangAuto benchmark series, with best driving scores of 44.3, 63.5, and 78.4 on LangAuto, LangAuto-Short, and LangAuto-Tiny, respectively, while maintaining high infraction scores of 0.89–0.95, outperforming recent VLA methods such as LMDrive and AD-H. Visualization and video results further validate the framework’s ability to follow complex language-conditioned instructions, adapt to dynamic environments, and prioritize safety. These findings highlight the potential of combining multimodal perception, language reasoning, and physics-aware adapters for robust and interpretable autonomous driving.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Ge et al. (Sun,) studied this question.

synapsesocial.com/papers/68e5c1c36950a706b22b5c10 https://doi.org/https://doi.org/10.3390/s25196163

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Perguntar à IA

Bookmark

View Full Paper