Video understanding seeks to enable machines to interpret visual content across three levels: action, event, and story. Existing models are limited in their ability to perform high-level long-term story understanding, due to (1) the oversimplified treatment of temporal information and (2) the training bias introduced by action/event-centric datasets. To address this, we introduce SCVBench, a novel benchmark for story-centric video understanding. SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 sub-question pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding. Additionally, our StoryCoT model significantly surpasses open-source LVLMs on SCVBench. SCVBench aims to advance research by comprehensively analyzing LVLMs' temporal reasoning and comprehension capabilities. Code can be accessed at https://github.com/yuanrr/SCVBench.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sisi You
Bowen Yuan
Bing-Kun Bao
Nanjing University of Posts and Telecommunications
Peng Cheng Laboratory
Tibetan Traditional Medical College
Building similarity graph...
Analyzing shared references across papers
Loading...
You et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68d469d631b076d99fa670d5 — DOI: https://doi.org/10.24963/ijcai.2025/255
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: