What question did this study set out to answer?

This research aims to enhance the movement of emergency vehicles in urban settings using a multimodal approach that combines vision and audio recognition.

April 27, 2026Open Access

Multimodal emergency vehicle prioritization through vision–audio fusion and attention-enhanced deep learning for smart traffic signal control

Key Points

This research aims to enhance the movement of emergency vehicles in urban settings using a multimodal approach that combines vision and audio recognition.
Developed a multimodal emergency-response framework integrating camera-based object recognition and audio classification.
Utilized a lightweight convolutional network for vision and a CBAM-augmented ResNet18 for audio detection.
Conducted experiments comparing the proposed system to unimodal detection methods.
Audio module achieved 100% precision, recall, and F1-score.
Vision module reached over 99% mAP @0.5 with processing speeds of approximately 33 FPS.
The integrated system significantly reduced intersection delays for emergency vehicles.

Abstract

Abstract Efficient movement of emergency vehicles (EVs) remains a major challenge in densely populated cities, where visual occlusion, low lighting, and high ambient noise often limit the effectiveness of single-sensor detection systems. To address these constraints, this study presents a multimodal emergency-response framework that integrates camera-based object recognition with an attention-driven audio classification model. The proposed architecture employs a lightweight convolutional vision detector coupled with a CBAM-augmented ResNet18 audio network, enabling complementary detection capabilities even in adverse traffic conditions. Fusion at the decision layer ensures robust identification of EV sirens and vehicle signatures with minimal latency. Experimental results demonstrate significant performance improvements: the audio module achieved 100% precision, recall, and F1-score, while the vision module attained more than 99% mAP @0.5 and sustained real-time processing speeds of approximately 33 FPS. Compared to unimodal systems, the proposed method achieved notably higher recall and precision, enabling more reliable activation of adaptive traffic signaling. The integrated system substantially reduces intersection delays for emergency vehicles and offers a scalable solution for modern intelligent transportation infrastructures. This intelligent traffic signal prioritization framework directly supports Sustainable Cities and Communities by enabling faster emergency response and reducing urban congestion-related emissions.

Bookmark

View Full Paper