Transformer-based Large Language Models (LLMs) heavily depend on the KV cache for efficient handling of long context sequences. However, the size of the KV cache grows linearly with the input sequence length, increasingly straining system memory, computational resources, bandwidth, and latency during decoding. Although recent research has proposed various techniques to compress the KV cache -targeting either storage or computational efficiency-few methods effectively achieve both simultaneously. Additionally, existing methods primarily rely on heuristic-driven approaches, lacking comprehensive insights into token selection criteria, and often significantly compromise model accuracy under strict KV cache token budget constraints (e.g., keeping 512 tokens). Building upon our recent work, RocketKV, this paper introduces EMPIRIC as an oracle-based vision study, which explicitly defines theoretical bounds for accuracy, computation, and storage in KV cache compression. By analyzing intrinsic patterns in KV cache attention heads, EMPIRIC provides novel insights into effective token pruning without accuracy degradation. This work clarifies the overlooked elements critical to KV cache compression during decoding and optimally balances computational efficiency, storage optimization, inference latency, and accuracy. We envision that EMPIRIC will guide future research efforts toward creating scalable, efficient KV cache compression techniques, significantly improving inference performance for long context LLM inference.
Building similarity graph...
Analyzing shared references across papers
Loading...
Payman Behnam
Yaosheng Fu
Ritchie Zhao
ACM SIGOPS Operating Systems Review
Georgia Institute of Technology
Nvidia (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Behnam et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68c1b81f54b1d3bfb60ec622 — DOI: https://doi.org/10.1145/3759441.3759448