What question did this study set out to answer?

The aim is to develop a comprehensive framework for understanding and predicting near-crash scenarios in driving accidents using video data.

February 14, 2026

ADVersa: Abductive Driving Accident Video Understanding

Key Points

The aim is to develop a comprehensive framework for understanding and predicting near-crash scenarios in driving accidents using video data.
Developed the ADVersa framework for analyzing accident data
Created the MM-AU dataset with over 11,000 driving accident videos
Utilized an Abductive CLIP model for cross-modal semantic learning
Conducted extensive experiments to compare ADVersa with state-of-the-art methods
ADVersa outperformed existing methods in recovering historical near-crash video frames
Successfully predicted future near-crash video frames
Achieved accurate reasoning for textual causes and categories of accidents
Enabled effective normal-to-accident video synthesis and editing

Abstract

Understanding traffic accident scenes is a long-standing research for vision-based safe driving. It seeks to answer why accidents occur, how near-crash scenes develop, and what the key elements of an accident are. This research is challenging due to the scarcity and fragmentation of accident data, as well as the complex accident environments. To study this, we present a framework of Abductive Driving accident Video understanding (ADVersa), which infers a plausible visual and textual explanation for the absent near-crash scenes. ADVersa underscores three groups of tasks: 1) visual past recovery of near-crash scenes, 2) visual prediction of near-crash scenes, and 3) accident cause involved video synthesis. To support the study, we first contribute MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild driving accident videos with temporally aligned text descriptions, 2.23 million well-annotated object boxes, and 58,650 pairs of video-based accident cause texts. We then propose an Abductive CLIP model and a Contrastive Graph Video Pre-training (CGVP) model, which exploit relation-aware cross-modal semantic learning to drive spatially abductive and temporally abductive accident video diffusion. Extensive experiments verify the superiority of ADVersa to the state-of-the-art approaches on different tasks, i.e., historical near-crash video frame recovering, crashing video frame prediction, textual accident cause and category reasoning, normal-to-accident video synthesis, and accident video editing. With these efforts, we hope this research can advance the progress on multimodal accident video understanding.

KI fragen

Bookmark

Cite This Study

Li et al. (Thu,) studied this question.

synapsesocial.com/papers/699010942ccff479cfe56ea9 https://doi.org/https://doi.org/10.1109/tpami.2026.3663545

KI fragen

Bookmark