The expanding use of large language models in intelligent workflows has triggered rapid advancement in Multimodal Large Language Models (MLLMs). Despite their success with short-form video, existing models still fall short when tasked with long-duration sequences. Temporal grounding, which involves localizing when an event occurs within a video, is a persistent weakness of video language models (VLMs), in part because existing training data provides answer supervision but lacks temporal evidence supervision. Without it, artificial intelligence cannot reliably assist with tasks such as medical surgery review, security monitoring, or analyzing long-form video data. Yet, existing benchmarks evaluate answer correctness without verifying whether a model identifies the correct supporting temporal evidence. We introduce ChronoQA, a multiple-choice QA benchmark that pairs every correct answer with a timestamp interval, providing an explicit grounding target that other video QA datasets omit, and enabling evaluation of answer accuracy and temporal localization quality. We construct ChronoQA from 423 long-form YouTube videos (10-90 minutes each), yielding 897 timestamp-anchored multiple-choice QA pairs across five task categories: Action Understanding, Needle-in-a-Haystack, Narrative Understanding, Causal Understanding, and Ordering. Through multimodel blind filtering, we constrain text-only solvability to 17.8%, near the 20% random baseline, ensuring visual perception rather than linguistic bias and visual information leakage.
Oluwatumininu Oguntola (Tue,) studied this question.