Abstract Efficient movement of emergency vehicles (EVs) remains a major challenge in densely populated cities, where visual occlusion, low lighting, and high ambient noise often limit the effectiveness of single-sensor detection systems. To address these constraints, this study presents a multimodal emergency-response framework that integrates camera-based object recognition with an attention-driven audio classification model. The proposed architecture employs a lightweight convolutional vision detector coupled with a CBAM-augmented ResNet18 audio network, enabling complementary detection capabilities even in adverse traffic conditions. Fusion at the decision layer ensures robust identification of EV sirens and vehicle signatures with minimal latency. Experimental results demonstrate significant performance improvements: the audio module achieved 100% precision, recall, and F1-score, while the vision module attained more than 99% mAP @0.5 and sustained real-time processing speeds of approximately 33 FPS. Compared to unimodal systems, the proposed method achieved notably higher recall and precision, enabling more reliable activation of adaptive traffic signaling. The integrated system substantially reduces intersection delays for emergency vehicles and offers a scalable solution for modern intelligent transportation infrastructures. This intelligent traffic signal prioritization framework directly supports Sustainable Cities and Communities by enabling faster emergency response and reducing urban congestion-related emissions.
Rathod et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: