Large language models (LLMs) face a critical memory bottleneck during long-context inference: the key-value (KV) cache grows linearly with sequence length, leading to excessive GPU memory usage and increased latency. Existing compression methods either evict tokens based solely on recency—losing semantically important content—or rely on global importance scores without protecting the query context that is most relevant for generation. We propose BladeRunnerSNT, a two‑mechanism framework that addresses both limitations. First, an adaptive budget allocation dynamically sets the number of retained slots using a norm‑derived importance ratio, decoupling the cache size from raw sequence length. Second, a query‑aware token protection reserves a fixed recency window that guarantees the presence of the question and answer choices, preventing the model from pruning the most critical tokens during generation. Across three model families (Qwen2‑7B, Mistral‑7B‑Instruct, Qwen2.5‑14B) and four benchmarks (LongBench‑v2, QuALITY, CNN/DailyMail, NIAH), BladeRunnerSNT consistently outperforms H2O, SnapKV, and StreamingLLM while retaining only ≈203 slots versus a full cache of ≈3,234 slots. On Mistral‑7B, it surpasses full KV attention by +10 percentage points on LongBench‑v2 with a 16× reduction in slots, suggesting that selective compression acts as a regulariser by suppressing irrelevant context noise. An ablation variant without the protection window confirms that query‑aware protection is the decisive component. BladeRunnerSNT introduces negligible overhead (less than 0.3% of time‑to‑first‑token) and offers robust performance across budgets and model sizes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Durhan Yazir
Building similarity graph...
Analyzing shared references across papers
Loading...
Durhan Yazir (Sun,) studied this question.
www.synapsesocial.com/papers/69f9890415588823dae17ecd — DOI: https://doi.org/10.5281/zenodo.20004265