What type of study is this?

This is a Quantitative Study study.

October 16, 2025Open Access

Task-Aware KV Compression For Cost-Effective Long Video Understanding

Key Points

Video-X^2L significantly improves long-video understanding performance while reducing computation cost.
The method employs bi-level KV compression with low-compression and high-compression KVs to balance detail and compactness.
Selective KV re-loading allows the model to utilize critical information efficiently during video processing.
Evaluation on various benchmarks shows clear superiority of Video-X^2L over existing compression techniques.

Abstract

Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X²L, which flexibly preserves critical video information for each LVU task. Video-X²L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X²L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X²L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X²L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X²L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X²L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper