What question did this study set out to answer?

This research establishes a mathematical framework to understand behaviors in high-dimensional language models.

May 31, 2026Open Access

The Mathematics of Large Language Models III: Structure — Concentration, Projection, and Selection Theorems

Q: What does this research mean for the field?

Useful behavior in high-dimensional stochastic systems like large language models emerges when training, architecture, data, and finite precision force observables through low-entropy structural projections, rendering residual fluctuations predictable. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.ESTABLISHES_NEW_DIRECTION.

MAMiquel Noguer AlonsoAllen Institute for Artificial Intelligence

Key Points

This research establishes a mathematical framework to understand behaviors in high-dimensional language models.
Developed three core selection theorems based on the CPL principle.
Conducted seven empirical tests with specific measurement criteria.
Formulated key propositions and conjectures across various tiers of analysis.
Demonstrated that observable-entropy collapses for Lipschitz functions below a critical metric entropy.
Showed robust separation of semantic classes is unachievable without altering Lipschitz scale or model concentration.
Exact projection theorem links excess log loss to conditional mutual information.

Abstract

We develop a mathematical research programme for language models organized around the Constraint–Projection–Limit (CPL) principle: useful behaviour in a high-dimensional stochastic system is selected when training, architecture, data, and finite precision force observables through low-entropy structural projections, after which concentration or stability renders the residual fluctuations predictable. The rigorous core consists of three selection statements. First, an observable-entropy collapse law shows that a Lipschitz function class determinizes when its metric entropy is subcritical relative to the ambient concentration rate. Second, a no-free-semantics theorem shows that robust separation of two positive-mass semantic classes on a normal Lévy family is impossible for uniformly Lipschitz logits unless Lipschitz scale, depth, class rarity, or the concentration model itself changes. Third, an exact projection theorem shows that the excess next-token log loss incurred by replacing the full context X with a structural code T(X) is precisely the conditional mutual information I(Y;X ∣ T). The projection step of CPL becomes an operational quantity rather than a metaphor. The remaining results are organized into three tiers. Tier I contains full theorems: concentration foundations and depth lower bounds, path-wise martingale variance, chain-rule and de Finetti diagnostics, Pinsker control of independence error, curvature comparison, G-graded spectra, detailed balance, Lipschitz Johnson–Lindenstrauss factorization, and cyclic position-encoding decomposition. Tier II gives conditional architectural propositions on biclustering, Kanerva collision bounds, softmax free energy, and non-reversible attention. Tier III states conjectures: transformer Lyapunov stability, feedback-channel bit budgets, factor-model rank ceilings, average-opinion bias, Fisher–Rao concentration, and the CPL universality target. Seven empirical tests with explicit kill criteria are proposed. The contribution is to isolate quantities that can be proved, measured, or falsified: Lipschitz scale, metric entropy, conditional mutual information, entropy production, Lyapunov spectra, effective rank, and path-wise variance.

Ask AI

Helpful

Bookmark

View Full Paper