What does this research mean for the field?

LLM context compaction causes severe information loss primarily through attention dilution rather than poor compression quality, and evaluating this degradation requires multi-replicate benchmarks due to extreme run-to-run variance. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to understand how message compaction affects recall in large language models (LLMs) during extended conversations.

May 20, 2026Open Access

lost in compaction

Key Points

This study aims to understand how message compaction affects recall in large language models (LLMs) during extended conversations.
Calibrated a recall benchmark using 234 facts from LongMemEval across a 190K-token context.
Isolated compaction loss by compacting 5-98% of context and measuring recall degradation.
Compared four multi-pass compaction strategies across a 5M-token conversation at multiple checkpoints.
Compaction as low as 5% led to a 7 percentage points drop in recall.
At 50% compaction, recall in the compacted zone dropped to 0-7%, regardless of the model used.
Key findings include the compaction strategies showing substantial performance variance, reinforcing the need for multiple replicates in evaluations.

Abstract

Long-running LLM conversations inevitably exceed the context window, forcing systems to compact old messages through summarization. But how do we know whether our compaction benchmark accurately measures information loss? Before comparing strategies, we need to understand how reliably an LLM can recall facts from its own context, and what factors affect that recall. We present a three-phase study. First, we calibrate a recall benchmark using 234 naturally-embedded facts from the LongMemEval dataset across a 190K-token context. We discover four measurement pitfalls that affect all compaction benchmarks: (1) a questions-per-prompt effect (we write Q for the number of questions in a single prompt) where Q=10 yields 11pp higher recall than Q=1 in static contexts, but the effect reverses under severe compaction, with Q=1 outperforming Q=5 by 9pp; (2) a category hierarchy where temporal reasoning (30% max) and preference recall (40% max) are near-impossible regardless of strategy; (3) a density saturation where recall plateaus beyond ~0.4 facts/kTok despite increasing evidence; and (4) a grep-LLM gap where keyword search finds 86% of facts but the LLM only recalls 25-79% of them: information is present in the context but ignored by attention. Second, we isolate compaction loss by compacting 5-98% of a context, re-padding to the original size, and measuring recall degradation. Even 5% compaction costs 7 percentage points of recall. At 50%, the compacted zone is near-dead (0-7% recall) despite keywords surviving at 82-93% via grep. Critically, compaction damages even the untouched portion of the context: remaining-zone recall drops from 68% to 39% as compaction increases, an attention dilution effect caused by injecting noise (re-padding) into the context. Cross-model validation with Claude Sonnet 4.6 (92.5% baseline recall, attenuated Lost-in-the-Middle profile) confirms that severe compaction destroys information regardless of model capability: even a stronger model with much shallower spatial bias drops to 21% at 98% compaction. Third, we compare four multi-pass compaction strategies (Brutal, Incremental, Frozen, FrozenRanked) on a single 5M-token conversation evaluated at five mid-feed checkpoints (500K to 5M) with constant fact density and 4-6 replicates per cell. The strategy hierarchy is consistent in the means: FrozenRanked > Frozen > Incremental > Brutal. All strategies degrade severely with scale (Frozen drops from 14.9% at 500K to 3.0% at 5M). With replicates we also document a substantial run-to-run variance: the compaction phase itself is non-deterministic at temperature zero and recall measurements on identical conversations span up to a factor of 14x (e.g. S4 at 1M: 2.6%-35.9% across replicates). Single-shot benchmarks of compaction strategies are therefore unreliable; replicates are mandatory. The bottleneck is attention capacity, not compression quality: keywords survive summarization but the LLM cannot retrieve them, and adding more preserved summaries dilutes attention rather than helping. All experiments use Anthropic Claude models (Haiku 4.5 and Sonnet 4.6). The methodological findings (Q-effect, judge-prompt sensitivity, run-to-run variance) are likely model-agnostic, but the absolute recall numbers and the strategy ordering should be re-validated on non-Claude models before generalising.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper