Autonomous driving (AD) systems remain vulnerable to rare, ambiguous, and out-of-label (OOL) hazards that are insufficiently represented in conventional training datasets. This work investigates perception robustness under such conditions by using the Challenge of Out-Of-Label (COOOL) benchmark dataset, which consists of 200 dashcam video sequences annotated with both common and uncommon traffic hazards. We analyze that the behavior of widely used methods in the perception of components and present a multimodal pipeline in which we integrate YOLO11x for object detection, Hough Transform for lane estimation, and GPT-4o for scene description, and for temporal modeling, we use Long Short-Term Memory (LSTM) networks. On the COOOL benchmark, YOLO11x achieves an mAP@0.5 of 54.1% on the common object categories, whereas the detection of rare and OFL hazards remains challenging, with a recall of 72.6%. Incorporating temporal risk modeling improves hazard recall to 71.8%, indicating a modest but consistent gain in recognizing uncommon events. Hough Transform shows the stable behavior in standard conditions for lane estimation, with a mean lateral deviation of 8.9 pixels in daylight scenes and 13.4 pixels under low-light conditions. The temporal anomaly detection module attains an AUROC of 0.65, reflecting the limitation but meaningful discrimination between nominal and anomalous driving situations. For interpretability, the GPT-4o scene description module generates context-aware textual explanations with an object coverage score of 0.72 and a factual consistency rate of 78%, as assessed through manual inspection. The end-to-end pipeline operates at approximately 10–12 frames per second on a single GPU, supporting near-real-time analysis and optimization. Our results confirm that state-of-the-art perception models struggle with OOL hazards and that multimodal vision–language–temporal integration provides incremental improvements in robustness and interpretability when evaluated under the standardized out-of-distribution conditions.
Mehmood et al. (Mon,) studied this question.