What question did this study set out to answer?

The aim is to address memory bottlenecks in long-context inference for large language models by optimizing key-value caching.

May 5, 2026Open Access

BladeRunnerSNT: Query-Aware Adaptive KV Cache Pruning for Efficient Long-Context Inference

Key Points

The aim is to address memory bottlenecks in long-context inference for large language models by optimizing key-value caching.
Developed BladeRunnerSNT with adaptive budget allocation for cache size management.
Implemented query-aware token protection to preserve essential tokens during generation.
Tested across different model families and benchmarks to assess performance and efficiency.
BladeRunnerSNT retains approximately 203 slots versus 3,234 slots in a full cache, enhancing memory efficiency.
Outperformed baseline models H2O, SnapKV, and StreamingLLM by +10 percentage points on LongBench-v2.
Introduces less than 0.3% overhead in time-to-first-token while maintaining performance across different budgets.

Abstract

Large language models (LLMs) face a critical memory bottleneck during long-context inference: the key-value (KV) cache grows linearly with sequence length, leading to excessive GPU memory usage and increased latency. Existing compression methods either evict tokens based solely on recency—losing semantically important content—or rely on global importance scores without protecting the query context that is most relevant for generation. We propose BladeRunnerSNT, a two‑mechanism framework that addresses both limitations. First, an adaptive budget allocation dynamically sets the number of retained slots using a norm‑derived importance ratio, decoupling the cache size from raw sequence length. Second, a query‑aware token protection reserves a fixed recency window that guarantees the presence of the question and answer choices, preventing the model from pruning the most critical tokens during generation. Across three model families (Qwen2‑7B, Mistral‑7B‑Instruct, Qwen2.5‑14B) and four benchmarks (LongBench‑v2, QuALITY, CNN/DailyMail, NIAH), BladeRunnerSNT consistently outperforms H2O, SnapKV, and StreamingLLM while retaining only ≈203 slots versus a full cache of ≈3,234 slots. On Mistral‑7B, it surpasses full KV attention by +10 percentage points on LongBench‑v2 with a 16× reduction in slots, suggesting that selective compression acts as a regulariser by suppressing irrelevant context noise. An ablation variant without the protection window confirms that query‑aware protection is the decisive component. BladeRunnerSNT introduces negligible overhead (less than 0.3% of time‑to‑first‑token) and offers robust performance across budgets and model sizes.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper