What question did this study set out to answer?

To establish a new criterion for evaluating fidelity in context compression of large language models.

March 31, 2026Open Access

If the Model can tell, the Compression is not Transparent: measuring Logit-Fidelity in LLM context compression

Key Points

To establish a new criterion for evaluating fidelity in context compression of large language models.
Introduced the LFCM benchmark to measure flip rates in token generation.
Evaluated KV-cache compression on Mistral-7B and Qwen-7B across various datasets.
Used a teacher-forced, temperature-zero protocol to minimize noise effects.
Uniform eviction strategies produced 10-25% flip rates across datasets.
Sliding-window strategy achieved 0.9% flip rates with 30% of conversations fully transparent.
Approximately 65% of token flips affected content words, impacting generation quality.

Abstract

Context compression for large language models is commonly evaluated using downstream task accuracy, perplexity, or semantic similarity. These are indirect proxies and can mask behavioral divergence at the token level. We propose a stronger fidelity criterion: a compressed context should produce the same greedy-decoded token as the full context at every generation step. To measure this, we introduce the LFCM benchmark (Logit-Faithful Context Memory), which uses flip rate: the fraction of generation steps where the argmax token changes under compression. The benchmark uses a teacher-forced, temperature-zero protocol to isolate the effect of compression from sampling noise. We evaluate KV-cache compression on Mistral-7B and Qwen-7B across five settings: synthetic, ShareGPT, MT-Bench, LongBench-v2, and HumanEval, totaling more than 300 conversations. Our results show that uniform eviction produces 10-25% flip rates across all datasets, while a sliding-window strategy (Recent) achieves 0.9% at r = 0.9, with 30% of conversations fully transparent. A severity analysis shows that 65% of flips affect content words. In free autoregressive generation, even a 4.6% teacher-forced flip rate yields only 33% BLEU, confirming that small flip rates can cascade into substantial behavioral divergence. These results suggest that flip rate at temperature zero is an essential metric for strict behavioral fidelity, and that it is currently missing from existing compression benchmarks.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper