Modern video understanding models excel on short clips of a few seconds but remain limited for real-world applications such as YouTube videos, movies, and egocentric recordings, which often span minutes to hours. These scenarios demand reasoning over complex temporal dependencies, presenting both algorithmic and computational challenges. This dissertation introduces a series of efficient models for long-range video understanding, advancing the state of the art across five tasks of increasing complexity and at progressively greater temporal scales. I first present ViS4mer, an efficient recognition model for videos of several minutes that integrates Transformer encoders with structured state-space sequence (S4) decoders, combining short-range spatiotemporal modeling with scalable long-range reasoning. ViS4mer is 2.6x faster and 8x more memory-efficient than Transformer baselines while achieving state-of-the-art performance on several long-form video benchmarks. Extending this idea, I propose TranS4mer, a movie scene detection model that employs a hybrid self-attention and state-space architecture, surpassing prior methods across multiple benchmarks at substantially lower computation cost. To scale video understanding to hour-long videos, I introduce BIMBA, an efficient multimodal large language model for video question answering that employs a bidirectional selective-scan mechanism to compress video inputs 16x, producing compact representations that achieve state-of-the-art performance on diverse long-range video QA benchmarks. I then present Video ReCap, a recursive captioning framework that generates descriptions at multiple temporal granularities - from atomic actions in short clips to summaries of hour-long videos - with an architecture that is theoretically unbounded in the duration it can process. Finally, I introduce VidAssist, a framework for goal-oriented planning in instructional videos that adopts a Socratic approach, converting video into textual form and leveraging LLMs through a Propose-Assess-Search paradigm to outperform fully supervised methods without any task-specific training. Collectively, these five contributions advance long-range video understanding across a broad and complementary set of tasks, demonstrating that efficient architectures grounded in state-space modeling, selective compression, recursive generation, and LLM-guided reasoning can overcome the computational barriers that have historically constrained video AI to short temporal windows. The work presented in this dissertation establishes principled foundations for building video understanding systems that operate at the temporal scales of human experience.
Building similarity graph...
Analyzing shared references across papers
Loading...
Md Mohaiminul Islam
University of North Carolina at Chapel Hill
University of North Carolina at Chapel Hill
Building similarity graph...
Analyzing shared references across papers
Loading...
Md Mohaiminul Islam (Fri,) studied this question.
synapsesocial.com/papers/6a1bd1db5783ba022b6fd4d9 — DOI: https://doi.org/10.17615/th13-ky27