What question did this study set out to answer?

The study aims to formalize context selection in language models and how it impacts information retention.

June 27, 2026Open Access

Context Selection as Approximate Sufficiency: Le Cam Deficiency and Kolmogorov's Structure Function Against a Fixed Decoder

Key Points

The study aims to formalize context selection in language models and how it impacts information retention.
Formalization of context selection using le cam deficiency and kolmogorov structure function.
Evaluation of retention curves across seven frozen decoders in four model families.
Documentation of a reproducible search protocol for validating findings.
Retention curves show that known content is retained flat while novel content is retained steeply across models.
Observed predictions align with theory, confirming class-separated compression curves.
Coded support and data released provide foundational insights into context selection.

Abstract

Inference-time context selection — choosing the bounded subset of an unbounded information stream that a frozen language model is allowed to see — has a precise formal identity that the literature solving it (retrieval-augmented generation, prompt compression, KV-cache eviction, long-context memory) does not name. This paper gives it two names, one for each half of the problem. The outer problem, which content to admit, is the minimization of Le Cam deficiency against the model as a fixed decision rule (Blackwell 1951; Le Cam 1964). The inner problem, how to represent each admitted item, is the Vereshchagin–Vitányi conditional structure function relative to that fixed decoder (2004). The contribution is recognition and exact reframing, not new theorems. The paper states the formalism to referee precision, proves the excess-risk identity and a non-vanishing capacity floor against it, imports a finite-sample generalization guarantee for the implied selector, and locates the object among the adjacent frameworks it is routinely confused with (mismatched rate–distortion, the information bottleneck, compressed sensing). The inner object makes a falsifiable prediction — that content classes produce heterogeneous, class-separated compression curves — confirmed across seven frozen decoders in four model families, with a curve-aware allocator derived from the curves behaving as the theory requires. A dated, reproducible search protocol documents the citation gap. The measured retention curves match what the structure function predicts: known content flat, novel content steep, across seven decoders in four model families. Companion essays argue the foundational pair in long form; code and data are released separately as the LM Recall repository.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper