Video Question Answering (VideoQA) requires a deep understanding of dynamic video content, integrating spatial reasoning, temporal dependencies, and language comprehension. Existing methods often struggle with long or semantically complex videos due to the lack of question-guided keyframe weight adjustment and the absence of question-aligned cross-modal description generation. To address these challenges, we propose ETR (Event-centric Temporal Reasoning), an adaptive framework for VideoQA. ETR introduces three key mechanisms: (i) a hierarchical weight adjustment selector to identify questions requiring event-centric temporal reasoning; (ii) a T-Route that segments videos into semantically coherent events and dynamically adjusts keyframe weights with question intent; and (iii) a question-conditioned prompting strategy that focuses on key objects to generate textual prompts aligned with a question’s semantics. This hierarchical and adaptive design effectively balances visual and textual information, enhances temporal reasoning, and improves object-centric alignment. Experiments on two datasets demonstrate that ETR achieves competitive performance in fine question-aware VideoQA.
Pan et al. (Sat,) studied this question.