Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in visual-language reasoning, yet long-video understanding remains a formidable challenge due to the need for coherent reasoning over ultra-long spatiotemporal dependencies. Existing methods struggle with the vast candidate space for relevant information in long videos, often failing to distinguish meaningful events from redundant content. We identify two critical and previously under-explored issues: absolute redundancy, where static visual content inflates token counts without adding narrative value, and relative redundancy, where task-irrelevant segments introduce noise that impairs reasoning. Compounding these issues is the weak spatiotemporal modeling in current MLLMs, which limits their ability to capture complex event dynamics. To address these multifaceted challenges, we introduce SELongVLM, a dynamically lenient-to-stringent selection long video language model. SELongVLM integrates two coordinated branches: a Residual Token Pruner (RTP) that removes repetitive background tokens via inter-frame residual modeling thus mitigating absolute redundancy while preserving motion cues, and a Semantic-aware Self-Correction Selector (SCSelector) that progressively refines query-relevant clip selection without frame-level annotations to reduce relative redundancy, guided by a stringent-to-lenient self-correcting mechanism during optimization. To ensure causal continuity and bolster spatiotemporal reasoning across disjoint clips, the framework further incorporates an action-aware operation for intra-clip dynamics and a temporal memory for cross-clip context, enabling robust spatiotemporal inference on long videos. Extensive experiments across eight benchmarks demonstrate that SELongVLM markedly outperforms existing models on both general and specialized long-video tasks. Specifically, it achieves 65.5% on VideoMME and 69.8% on MLVU for general benchmarks, and delivers strong performance on four specialized benchmarks - for example, 39.2% on TOMATO for fine-grained temporal reasoning and 69.2% on EventBench for event-level understanding.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kecheng Zhang
Zongxin Yang
Mingfei Han
IEEE Transactions on Pattern Analysis and Machine Intelligence
University of Science and Technology of China
Dalian University of Technology
Dana-Farber/Harvard Cancer Center
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69b64ccdb42794e3e660def6 — DOI: https://doi.org/10.1109/tpami.2026.3673141