What question did this study set out to answer?

To assess token-level behavioral fidelity in LLMs under various contextual compression strategies using a new benchmark.

April 13, 2026Open Access

If the Model can tell, the Compression is not Transparent: measuring Logit-Fidelity in LLM context compression

Key Points

To assess token-level behavioral fidelity in LLMs under various contextual compression strategies using a new benchmark.
Developed LFCM benchmark to measure flip rates at τ = 0 during token generation.
Evaluated three KV-cache compression strategies: uniform eviction, H2O-approx, and recency-based.
Used a deterministic evaluation protocol with quantized attention methods across multiple models and datasets.
All compression methods resulted in significant flip rates, ranging from 5% to 50%.
Method rankings changed for long conversations, with recency-based methods exhibiting up to 3× higher flip rates.
The qualitative ranking of compression strategies was consistent across different model families and configurations.

Abstract

We introduce LFCM (Logit-Faithful Context Memory), a benchmark designed to measure token-level behavioral fidelity under LLM context compression. LFCM quantifies the flip rate at temperature zero (τ = 0) - the fraction of generation steps where the greedy argmax token changes when the KV-cache is compressed. Using a teacher-forced, fully deterministic protocol (4-bit NF4 quantization, SDPA attention), we evaluate three KV-cache compression strategies - uniform eviction, H2O-approx, and recency-based (StreamingLLM-style) - across multiple models (Mistral-7B, Qwen-7B, Qwen-14B) and datasets, including both synthetic conversations and long naturalistic contexts (ShareGPT). Our results reveal three key findings: Non-trivial divergence: All methods produce significant flip rates (5–50%), even at high retention levels. Context-dependent inversion: On long real-world conversations, method rankings invert - recency-based eviction produces up to 3× higher flip rates due to the loss of early, structurally important context. Cross-model robustness: The qualitative ranking of methods is preserved across model families and scales (7B → 14B). This benchmark exposes a critical blind spot in current evaluation practices: standard metrics (accuracy, perplexity, semantic similarity) do not capture whether compressed models preserve exact token-level behavior. LFCM provides a complementary, deterministic measure of strict behavioral transparency, essential for reproducible pipelines, persistent-memory agents, and safety-critical applications. Version 2 updates: Fixes a first-token evaluation bug, adds cross-scale validation (14B), and introduces the context-structure dependence finding.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Régis RIGAUD (Sat,) studied this question.

synapsesocial.com/papers/69dc89823afacbeac03eb2a9 https://doi.org/https://doi.org/10.5281/zenodo.19510356

Demander à l'IA

Bookmark

View Full Paper