What question did this study set out to answer?

The research aims to develop efficient models for understanding long-range videos, addressing challenges in temporal dependencies.

May 31, 2026Open Access

Efficient Models for Long-Range Video Understanding

MIMd Mohaiminul IslamUniversity of North Carolina at Chapel Hill

Key Points

The research aims to develop efficient models for understanding long-range videos, addressing challenges in temporal dependencies.
Introduced ViS4mer for efficient video recognition combining transformer encoders with structured S4 decoders.
Developed TranS4mer for movie scene detection using hybrid architectures, improving performance at lower computational costs.
Created BIMBA, a multimodal LLM for video QA, utilizing selective-scan for compression, and designed Video ReCap for multi-level description generation.
ViS4mer achieved state-of-the-art performance with 2.6x speed and 8x memory efficiency over transformer baselines.
TranS4mer outperformed previous models across multiple benchmarks with reduced computation.
BIMBA produced compact video representations achieving top performance on long-range QA benchmarks.

Abstract

Modern video understanding models excel on short clips of a few seconds but remain limited for real-world applications such as YouTube videos, movies, and egocentric recordings, which often span minutes to hours. These scenarios demand reasoning over complex temporal dependencies, presenting both algorithmic and computational challenges. This dissertation introduces a series of efficient models for long-range video understanding, advancing the state of the art across five tasks of increasing complexity and at progressively greater temporal scales. I first present ViS4mer, an efficient recognition model for videos of several minutes that integrates Transformer encoders with structured state-space sequence (S4) decoders, combining short-range spatiotemporal modeling with scalable long-range reasoning. ViS4mer is 2.6x faster and 8x more memory-efficient than Transformer baselines while achieving state-of-the-art performance on several long-form video benchmarks. Extending this idea, I propose TranS4mer, a movie scene detection model that employs a hybrid self-attention and state-space architecture, surpassing prior methods across multiple benchmarks at substantially lower computation cost. To scale video understanding to hour-long videos, I introduce BIMBA, an efficient multimodal large language model for video question answering that employs a bidirectional selective-scan mechanism to compress video inputs 16x, producing compact representations that achieve state-of-the-art performance on diverse long-range video QA benchmarks. I then present Video ReCap, a recursive captioning framework that generates descriptions at multiple temporal granularities - from atomic actions in short clips to summaries of hour-long videos - with an architecture that is theoretically unbounded in the duration it can process. Finally, I introduce VidAssist, a framework for goal-oriented planning in instructional videos that adopts a Socratic approach, converting video into textual form and leveraging LLMs through a Propose-Assess-Search paradigm to outperform fully supervised methods without any task-specific training. Collectively, these five contributions advance long-range video understanding across a broad and complementary set of tasks, demonstrating that efficient architectures grounded in state-space modeling, selective compression, recursive generation, and LLM-guided reasoning can overcome the computational barriers that have historically constrained video AI to short temporal windows. The work presented in this dissertation establishes principled foundations for building video understanding systems that operate at the temporal scales of human experience.

KI fragen

Bookmark

View Full Paper