Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, performing well only on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges. Leveraging the hierarchical memory structure of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination, we propose MovieChat within a training-free memory consolidation mechanism to overcome these challenges, which transfers dense frames from short-term memory into sparse tokens in long-term memory by temporally merging adjacent frames. We lift pre-trained large multi-modal models for understanding long videos without additional trainable modules, employing a zero-shot approach. Additionally, in our new version, MovieChat+, we design an enhanced training-free vision-question matching-based memory consolidation mechanism to better anchor predictions to relevant visual content. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations. Resources are available at: https://github.com/rese1f/MovieChat.
Building similarity graph...
Analyzing shared references across papers
Loading...
Enxin Song
Wenhao Chai
Ye Tian
IEEE Transactions on Pattern Analysis and Machine Intelligence
University of Washington
University of Illinois Urbana-Champaign
University of Hong Kong
Building similarity graph...
Analyzing shared references across papers
Loading...
Song et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68c1824b9b7b07f3a060e956 — DOI: https://doi.org/10.1109/tpami.2025.3604614