We introduce LFCM (Logit-Faithful Context Memory), a benchmark designed to measure token-level behavioral fidelity under LLM context compression. LFCM quantifies the flip rate at temperature zero (τ = 0) - the fraction of generation steps where the greedy argmax token changes when the KV-cache is compressed. Using a teacher-forced, fully deterministic protocol (4-bit NF4 quantization, SDPA attention), we evaluate three KV-cache compression strategies - uniform eviction, H2O-approx, and recency-based (StreamingLLM-style) - across multiple models (Mistral-7B, Qwen-7B, Qwen-14B) and datasets, including both synthetic conversations and long naturalistic contexts (ShareGPT). Our results reveal three key findings: Non-trivial divergence: All methods produce significant flip rates (5–50%), even at high retention levels. Context-dependent inversion: On long real-world conversations, method rankings invert - recency-based eviction produces up to 3× higher flip rates due to the loss of early, structurally important context. Cross-model robustness: The qualitative ranking of methods is preserved across model families and scales (7B → 14B). This benchmark exposes a critical blind spot in current evaluation practices: standard metrics (accuracy, perplexity, semantic similarity) do not capture whether compressed models preserve exact token-level behavior. LFCM provides a complementary, deterministic measure of strict behavioral transparency, essential for reproducible pipelines, persistent-memory agents, and safety-critical applications. Version 2 updates: Fixes a first-token evaluation bug, adds cross-scale validation (14B), and introduces the context-structure dependence finding.
Régis RIGAUD (Sat,) studied this question.