The rapid development of autonomous vehicles is based mainly on their ability to accurately perceive their environment, where artificial intelligence and computer vision act as the core of environmental perception. In this regard, deep learning-based perception architectures have revolutionized the field of autonomous driving. However, as the use of single sensors fails to ensure reliability in complex scenarios, multimodal sensor fusion has become an essential part of modern deep learning architectures. In this context, covering the literature from 2020 to 2025, we analyze the transition from traditional Convolutional Neural Networks (CNNs) to modern Vision Transformers (ViTs) and explore data fusion design methodologies at various processing levels. In addition, significant limitations related to adverse weather conditions and dynamic environments, computational resources and overall quality and management of data are identified. The conducted comparative analysis indicates that vision-transformer and multimodal fusion methodologies provide higher accuracy in perception tasks but at the cost of increased computational requirements and sensor synchronization challenges. Finally, it becomes clear that achieving full autonomy requires further research in subjects such as collaborative perception, unsupervised domain adaptation and the creation of lightweight models, thus offering a roadmap for future developments.
Nikolaidis et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: