What type of study is this?

This is a Experimental Study study.

October 13, 2025Open Access

Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

Key Points

SG-VLM significantly improves causal reasoning performance in video question answering, enhancing interpretability.
Across multiple benchmarks like NExT-QA and iVQA, SG-VLM shows improved temporal reasoning capabilities with VLMs.
By integrating scene graphs with vision language models, the framework enhances rational reasoning in video understanding.
Findings suggest symbolic grounding has potential but also highlights limitations in achieving superior VLM performance.

Abstract

Video Question Answering (VQA) requires models to reason over spatial, temporal, and causal cues in videos. Recent vision language models (VLMs) achieve strong results but often rely on shallow correlations, leading to weak temporal grounding and limited interpretability. We study symbolic scene graphs (SGs) as intermediate grounding signals for VQA. SGs provide structured object-relation representations that complement VLMs holistic reasoning. We introduce SG-VLM, a modular framework that integrates frozen VLMs with scene graph grounding via prompting and visual localization. Across three benchmarks (NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM improves causal and temporal reasoning and outperforms prior baselines, though gains over strong VLMs are limited. These findings highlight both the promise and current limitations of symbolic grounding, and offer guidance for future hybrid VLM-symbolic approaches in video understanding.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper