What type of study is this?

September 10, 2025Open Access

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Key Points

MovieChat+ achieves state-of-the-art performance in long video understanding, significantly improving prediction accuracy.
The method utilizes a hierarchical memory structure inspired by the Atkinson-Shiffrin memory model to manage long temporal connections.
By adopting a zero-shot approach with pre-trained large multi-modal models, the framework avoids the need for additional trainable modules.
The newly released MovieChat-1K benchmark provides valuable resources for advancing research in long video question answering.

Abstract

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, performing well only on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges. Leveraging the hierarchical memory structure of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination, we propose MovieChat within a training-free memory consolidation mechanism to overcome these challenges, which transfers dense frames from short-term memory into sparse tokens in long-term memory by temporally merging adjacent frames. We lift pre-trained large multi-modal models for understanding long videos without additional trainable modules, employing a zero-shot approach. Additionally, in our new version, MovieChat+, we design an enhanced training-free vision-question matching-based memory consolidation mechanism to better anchor predictions to relevant visual content. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations. Resources are available at: https://github.com/rese1f/MovieChat.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Song et al. (Wed,) studied this question.

synapsesocial.com/papers/68c1824b9b7b07f3a060e956 https://doi.org/https://doi.org/10.1109/tpami.2025.3604614

Perguntar à IA

Bookmark

View Full Paper