Recent advances in Large Vision–Language Models (LVLMs) have demonstrated strong cross-modal reasoning capabilities, offering new opportunities for decision-making in autonomous driving. However, existing end-to-end approaches still suffer from limited semantic consistency, weak task controllability, and insufficient interpretability. To address these challenges, we propose SemAlign-E2E (Semantic-Aligned End-to-End), a semantic-aligned multimodal LVLM framework that unifies visual, LiDAR, and task-oriented textual inputs through cross-modal attention. This design enables end-to-end reasoning from scene understanding to high-level driving command generation. Beyond producing structured control instructions, the framework also provides natural-language explanations to enhance interpretability. We conduct extensive evaluations on the nuScenes dataset and CARLA simulation platform. Experimental results show that SemAlign-E2E achieves substantial improvements in driving stability, safety, multi-task generalization, and semantic comprehension, consistently outperforming state-of-the-art baselines. Notably, the framework exhibits superior behavioral consistency and risk-aware decision-making in complex traffic scenarios. These findings highlight the potential of LVLM-driven semantic reasoning for autonomous driving and provide a scalable pathway toward future semantic-enhanced end-to-end driving systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Feng Peng
Shangju She
Zejian Deng
Machines
University of Hong Kong
Chinese University of Hong Kong
Wuhan University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Peng et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69730f78c8125b09b0d1f3d4 — DOI: https://doi.org/10.3390/machines14010125