What question did this study set out to answer?

The aim is to improve long video understanding in MLLMs by addressing redundancy issues and enhancing spatiotemporal reasoning.

March 15, 2026

SELongVLM: Empowering Long Video Language Models with Self-Corrective Clip Selection

Puntos clave

The aim is to improve long video understanding in MLLMs by addressing redundancy issues and enhancing spatiotemporal reasoning.
Introduced SELongVLM to manage video content effectively.
Implemented Residual Token Pruner to eliminate unnecessary background tokens.
Developed a Semantic-aware Self-Correction Selector for refining clip selection without annotations.
Incorporated an action-aware operation for understanding intra-clip dynamics and a temporal memory for cross-clip context.
Achieved 65.5% on VideoMME and 69.8% on MLVU for general benchmarks.
Demonstrated significant improvements in four specialized benchmarks, including 39.2% on TOMATO for temporal reasoning.
Outlined a robust enhancement over existing models in long-video tasks.

Resumen

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in visual-language reasoning, yet long-video understanding remains a formidable challenge due to the need for coherent reasoning over ultra-long spatiotemporal dependencies. Existing methods struggle with the vast candidate space for relevant information in long videos, often failing to distinguish meaningful events from redundant content. We identify two critical and previously under-explored issues: absolute redundancy, where static visual content inflates token counts without adding narrative value, and relative redundancy, where task-irrelevant segments introduce noise that impairs reasoning. Compounding these issues is the weak spatiotemporal modeling in current MLLMs, which limits their ability to capture complex event dynamics. To address these multifaceted challenges, we introduce SELongVLM, a dynamically lenient-to-stringent selection long video language model. SELongVLM integrates two coordinated branches: a Residual Token Pruner (RTP) that removes repetitive background tokens via inter-frame residual modeling thus mitigating absolute redundancy while preserving motion cues, and a Semantic-aware Self-Correction Selector (SCSelector) that progressively refines query-relevant clip selection without frame-level annotations to reduce relative redundancy, guided by a stringent-to-lenient self-correcting mechanism during optimization. To ensure causal continuity and bolster spatiotemporal reasoning across disjoint clips, the framework further incorporates an action-aware operation for intra-clip dynamics and a temporal memory for cross-clip context, enabling robust spatiotemporal inference on long videos. Extensive experiments across eight benchmarks demonstrate that SELongVLM markedly outperforms existing models on both general and specialized long-video tasks. Specifically, it achieves 65.5% on VideoMME and 69.8% on MLVU for general benchmarks, and delivers strong performance on four specialized benchmarks - for example, 39.2% on TOMATO for fine-grained temporal reasoning and 69.2% on EventBench for event-level understanding.

Preguntar a la IA

Me gusta

Guardar