What type of study is this?

This is a Quantitative Study study.

September 20, 2025

SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding

Key Points

SCVBench demonstrates the need for improved temporal reasoning in video understanding models.
The benchmark includes 1,253 final questions and 6,027 sub-questions based on 925 videos.
Experimental results show GPT-4o outperforms open-source LVLMs in story-centric video tasks.
The SCVBench aims to enhance research by providing a comprehensive analysis of dialogue exploration capabilities.

Abstract

Video understanding seeks to enable machines to interpret visual content across three levels: action, event, and story. Existing models are limited in their ability to perform high-level long-term story understanding, due to (1) the oversimplified treatment of temporal information and (2) the training bias introduced by action/event-centric datasets. To address this, we introduce SCVBench, a novel benchmark for story-centric video understanding. SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 sub-question pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding. Additionally, our StoryCoT model significantly surpasses open-source LVLMs on SCVBench. SCVBench aims to advance research by comprehensively analyzing LVLMs' temporal reasoning and comprehension capabilities. Code can be accessed at https://github.com/yuanrr/SCVBench.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper