What question did this study set out to answer?

To address the weaknesses of video language models in temporal grounding for long-form videos.

June 11, 2026Open Access

ChronoQA: Benchmarking Temporal Grounding in Long-Form Video via Timestamp-Anchored Multiple-Choice QA

Key Points

To address the weaknesses of video language models in temporal grounding for long-form videos.
Constructed ChronoQA with 423 long-form YouTube videos (10-90 minutes each)
Developed 897 timestamp-anchored multiple-choice QA pairs across five task categories
Implemented multimodal blind filtering to minimize text-only solvability to 17.8%
ChronoQA enables evaluation of answer accuracy and temporal localization quality
Visual perception requirement reduced text-only solvability to 17.8%, close to the 20% chance baseline
Highlighted the inadequacy of existing benchmarks for temporal evidence supervision

Abstract

The expanding use of large language models in intelligent workflows has triggered rapid advancement in Multimodal Large Language Models (MLLMs). Despite their success with short-form video, existing models still fall short when tasked with long-duration sequences. Temporal grounding, which involves localizing when an event occurs within a video, is a persistent weakness of video language models (VLMs), in part because existing training data provides answer supervision but lacks temporal evidence supervision. Without it, artificial intelligence cannot reliably assist with tasks such as medical surgery review, security monitoring, or analyzing long-form video data. Yet, existing benchmarks evaluate answer correctness without verifying whether a model identifies the correct supporting temporal evidence. We introduce ChronoQA, a multiple-choice QA benchmark that pairs every correct answer with a timestamp interval, providing an explicit grounding target that other video QA datasets omit, and enabling evaluation of answer accuracy and temporal localization quality. We construct ChronoQA from 423 long-form YouTube videos (10-90 minutes each), yielding 897 timestamp-anchored multiple-choice QA pairs across five task categories: Action Understanding, Needle-in-a-Haystack, Narrative Understanding, Causal Understanding, and Ordering. Through multimodel blind filtering, we constrain text-only solvability to 17.8%, near the 20% random baseline, ensuring visual perception rather than linguistic bias and visual information leakage.

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Oluwatumininu Oguntola (Tue,) studied this question.

synapsesocial.com/papers/6a2a508980c8f91e7f39d13b https://doi.org/https://doi.org/10.17615/5p3g-3312

AI에게 질문

Bookmark

View Full Paper