What question did this study set out to answer?

The aim is to enhance the recognition of rare and ambiguous hazards in autonomous driving systems that are often missed by existing models.

February 5, 2026Open Access

Unseen Hazard Recognition in Autonomous Driving Using Vision–Language and Sensor-Based Temporal Models

Key Points

The aim is to enhance the recognition of rare and ambiguous hazards in autonomous driving systems that are often missed by existing models.
Utilized the COOOL benchmark dataset of dashcam video sequences annotating traffic hazards.
Integrated YOLO11x for object detection and Hough Transform for lane estimation.
Employed GPT-4o for scene description and LSTM networks for temporal modeling.
Measured performance using metrics like mAP and recall under varied conditions.
YOLO11x achieved a mean average precision (mAP) of 54.1% for common object categories.
Incorporating temporal modeling improved hazard recall to 71.8%.
Mean lateral deviation for lane estimation was 8.9 pixels in daylight and 13.4 pixels in low-light conditions.
The temporal anomaly detection module achieved an AUROC of 0.65, indicating its effectiveness in distinguishing driving situations.

Abstract

Autonomous driving (AD) systems remain vulnerable to rare, ambiguous, and out-of-label (OOL) hazards that are insufficiently represented in conventional training datasets. This work investigates perception robustness under such conditions by using the Challenge of Out-Of-Label (COOOL) benchmark dataset, which consists of 200 dashcam video sequences annotated with both common and uncommon traffic hazards. We analyze that the behavior of widely used methods in the perception of components and present a multimodal pipeline in which we integrate YOLO11x for object detection, Hough Transform for lane estimation, and GPT-4o for scene description, and for temporal modeling, we use Long Short-Term Memory (LSTM) networks. On the COOOL benchmark, YOLO11x achieves an mAP@0.5 of 54.1% on the common object categories, whereas the detection of rare and OFL hazards remains challenging, with a recall of 72.6%. Incorporating temporal risk modeling improves hazard recall to 71.8%, indicating a modest but consistent gain in recognizing uncommon events. Hough Transform shows the stable behavior in standard conditions for lane estimation, with a mean lateral deviation of 8.9 pixels in daylight scenes and 13.4 pixels under low-light conditions. The temporal anomaly detection module attains an AUROC of 0.65, reflecting the limitation but meaningful discrimination between nominal and anomalous driving situations. For interpretability, the GPT-4o scene description module generates context-aware textual explanations with an object coverage score of 0.72 and a factual consistency rate of 78%, as assessed through manual inspection. The end-to-end pipeline operates at approximately 10–12 frames per second on a single GPU, supporting near-real-time analysis and optimization. Our results confirm that state-of-the-art perception models struggle with OOL hazards and that multimodal vision–language–temporal integration provides incremental improvements in robustness and interpretability when evaluated under the standardized out-of-distribution conditions.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper