Abstract The explosive proliferation of long-form video has exposed fundamental limitations in current AI systems, particularly in sustaining coherent temporal reasoning under stringent computational budgets. This survey synthesizes recent progress in long video understanding (LVU) with a deliberate focus on training-free methods, approaches that forgo retraining pretrained base models and instead construct dedicated LVU pipelines to maximally leverage their latent capabilities. We identify three core challenges that structure this landscape: (i) pervasive visual-token redundancy that inflates computation while contributing marginal information gain, (ii) constrained context windows that fragment temporal and semantic coherence, and (iii) the requirement for robust multi-modal reasoning across expansive temporal horizons. To address these challenges, existing solutions can be organized into three methodological paradigms. First, selection-based approaches target redundancy via semantically informed sampling and compression. Second, memory-enhanced architectures expand the effective temporal receptive field through hierarchical representations and streaming-aware mechanisms. Third, agent-based reasoning frameworks reconceptualize LVU as an active, iterative process driven by explicit goals and context-sensitive inference, enabling dynamic control over when and how video segments are processed. In parallel, we examine a growing specialization of evaluation protocols intended to assess LVU systems holistically. These protocols coalesce into three principal categories: general-purpose benchmarks that probe foundational, cross-task capabilities; reasoning-centric benchmarks that stress-test higher-order competencies, such as spatio-temporal abstraction, causal and physical reasoning, and compositional understanding; and egocentric benchmarks that emphasize first-person perception and action reasoning within dynamically evolving real-world environments. Collectively, these benchmarks illuminate both the representational fidelity and the reasoning depth required for robust deployment. Our findings indicate a pivotal inflection in the field: ad hoc, engineering-driven heuristics are giving way to principled, end-to-end frameworks that integrate scalable computation with cognitively grounded reasoning. Rather than merely amplifying base-model capacity, this shift advances a coherent agenda centered on pragmatic, auditable, and resource-aware LVU pipelines atop pretrained models, thereby extending LVU to professional analytics, embodied human–AI interaction, and safety-critical decision making.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jingren Liu
Yun Wang
Long Zhang
Vicinagearth.
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e97a43edb160cc8d84e74c — DOI: https://doi.org/10.1007/s44336-025-00017-w