Context compression for large language models is commonly evaluated using downstream task accuracy, perplexity, or semantic similarity. These are indirect proxies and can mask behavioral divergence at the token level. We propose a stronger fidelity criterion: a compressed context should produce the same greedy-decoded token as the full context at every generation step. To measure this, we introduce the LFCM benchmark (Logit-Faithful Context Memory), which uses flip rate: the fraction of generation steps where the argmax token changes under compression. The benchmark uses a teacher-forced, temperature-zero protocol to isolate the effect of compression from sampling noise. We evaluate KV-cache compression on Mistral-7B and Qwen-7B across five settings: synthetic, ShareGPT, MT-Bench, LongBench-v2, and HumanEval, totaling more than 300 conversations. Our results show that uniform eviction produces 10-25% flip rates across all datasets, while a sliding-window strategy (Recent) achieves 0.9% at r = 0.9, with 30% of conversations fully transparent. A severity analysis shows that 65% of flips affect content words. In free autoregressive generation, even a 4.6% teacher-forced flip rate yields only 33% BLEU, confirming that small flip rates can cascade into substantial behavioral divergence. These results suggest that flip rate at temperature zero is an essential metric for strict behavioral fidelity, and that it is currently missing from existing compression benchmarks.
Régis RIGAUD (Sat,) studied this question.