What question did this study set out to answer?

The study aims to analyze advancements in deep learning architectures and sensor fusion for autonomous vehicle perception.

May 24, 2026Open Access

Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion

Key Points

The study aims to analyze advancements in deep learning architectures and sensor fusion for autonomous vehicle perception.
Literature review from 2020 to 2025 on deep learning architectures including CNNs and ViTs.
Comparative analysis of perception tasks and limitations related to multimodal sensor integration.
Assessment of challenges in adverse weather conditions and data management for autonomous systems.
Vision-transformer and multimodal methodologies demonstrated higher accuracy in perception tasks but increased computational demands.
Identified challenges with sensor synchronization and the need for robust data management in dynamic environments.
Future directions include research in collaborative perception and lightweight model development.

Abstract

The rapid development of autonomous vehicles is based mainly on their ability to accurately perceive their environment, where artificial intelligence and computer vision act as the core of environmental perception. In this regard, deep learning-based perception architectures have revolutionized the field of autonomous driving. However, as the use of single sensors fails to ensure reliability in complex scenarios, multimodal sensor fusion has become an essential part of modern deep learning architectures. In this context, covering the literature from 2020 to 2025, we analyze the transition from traditional Convolutional Neural Networks (CNNs) to modern Vision Transformers (ViTs) and explore data fusion design methodologies at various processing levels. In addition, significant limitations related to adverse weather conditions and dynamic environments, computational resources and overall quality and management of data are identified. The conducted comparative analysis indicates that vision-transformer and multimodal fusion methodologies provide higher accuracy in perception tasks but at the cost of increased computational requirements and sensor synchronization challenges. Finally, it becomes clear that achieving full autonomy requires further research in subjects such as collaborative perception, unsupervised domain adaptation and the creation of lightweight models, thus offering a roadmap for future developments.

Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion

Key Points

Abstract

Cite This Study

Also Consider

Also Consider