Understanding traffic accident scenes is a long-standing research for vision-based safe driving. It seeks to answer why accidents occur, how near-crash scenes develop, and what the key elements of an accident are. This research is challenging due to the scarcity and fragmentation of accident data, as well as the complex accident environments. To study this, we present a framework of Abductive Driving accident Video understanding (ADVersa), which infers a plausible visual and textual explanation for the absent near-crash scenes. ADVersa underscores three groups of tasks: 1) visual past recovery of near-crash scenes, 2) visual prediction of near-crash scenes, and 3) accident cause involved video synthesis. To support the study, we first contribute MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild driving accident videos with temporally aligned text descriptions, 2.23 million well-annotated object boxes, and 58,650 pairs of video-based accident cause texts. We then propose an Abductive CLIP model and a Contrastive Graph Video Pre-training (CGVP) model, which exploit relation-aware cross-modal semantic learning to drive spatially abductive and temporally abductive accident video diffusion. Extensive experiments verify the superiority of ADVersa to the state-of-the-art approaches on different tasks, i.e., historical near-crash video frame recovering, crashing video frame prediction, textual accident cause and category reasoning, normal-to-accident video synthesis, and accident video editing. With these efforts, we hope this research can advance the progress on multimodal accident video understanding.
Building similarity graph...
Analyzing shared references across papers
Loading...
Leilei Li
Jianwu Fang
Junbin Xiao
IEEE Transactions on Pattern Analysis and Machine Intelligence
National University of Singapore
Nanyang Technological University
Xi'an Jiaotong University
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Thu,) studied this question.
www.synapsesocial.com/papers/699010942ccff479cfe56ea9 — DOI: https://doi.org/10.1109/tpami.2026.3663545