What question did this study set out to answer?

The study aims to predict how transformers' attention mechanisms operate under the framework of power-law theory.

May 22, 2026Open Access

Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools

Key Points

The study aims to predict how transformers' attention mechanisms operate under the framework of power-law theory.
Validated predictions on 30+ transformer models from Pythia to Qwen2.5-7B.
Utilized a six-axis γ-decomposition and a pretraining pilot with controlled parameters.
Developed a browser-based diagnostic tool for formula implementation and data reproducibility.
Achieved median MAE of 4.3% for non-anomalous subsets based on the geometric centroid.
Established a regime diagram for long-context attention use classified by γ values.
Validated higher-order predictions where power law outperformed exponential in 54/56 measurements.

Abstract

Companion — Part II: the follow-up paper Predicting How Transformers Attend, Part II (DOI 10. 5281/zenodo. 19960573) https: //doi. org/10. 5281/zenodo. 19960573 extends this work with a six-axis γ-decomposition (including the learned-imprint axis ν=−1/ (2π) ), an NF4 precision-sensitivity rule, the Cardy entropy anomaly, a bimodal Hagedorn phase structure, and a Sage+Lean machine-verified algebraic backbone (15 D-SAGE identities). Version 3 (corrected, 2026-05-20): Hagedorn heat-capacity coefficient corrected to CV (gamma=1, N) = (log N) ²/12 in Thm. 7. 1 (previously /4). Added prior-art citation to Qu, Ly landing-page figure updated. Spanish edition updated in parallel. A first-principles explanation of the ubiquitous power-law decay of attention weights in transformer LLMs. The RoPE positional encoding imposes a log-distance constraint on the attention score; the maximum-entropy distribution compatible with that constraint is a power law A (d) ∝ d^ (-γ) with closed-form exponent γ = (2θ - Tₑval √2) / (2θ + Tₑval √2) (the 1, 1 Padé approximant of e^ (-z) ). Validated on 30+ models from Pythia-70M to Qwen2. 5-7B, median MAE 4. 3% (n=9 non-anomalous subset, n=56 full panel) on the geometric centroid; corpus / architecture / induction-head phase contribute the residual variance via a five-axis decomposition (R²=0. 44 on n=23). Three operational consequences: a regime diagram (γ1) classifying long-context use, a closed-form KV-cache compression window Df predicting the operating point that empirical methods (SnapKV, PyramidKV, BLASST) calibrate by sweep, and a closed-form NTK base scaling αₒpt for zero-shot context extension — Pareto-dominant on n=4 Pythia models against the unscaled baseline which collapses to chance retrieval at L > Tₜrain. A controlled-θ pretraining pilot at θ ∈ 10⁴, 10⁵, 10⁶ confirms quantitative agreement (max 5. 07% relative error vs Padé) under causal isolation. Higher-order predictions empirically validated: power law beats exponential 54/56 measurements; per-layer γ stability CV<0. 20 on 5/5 models. A free, browser-based diagnostic tool implementing every formula at https: //karlesmarin. github. io/tafagent (Apache-2. 0). Source and reproducibility data (343 JSON measurement files, 5. 5 MB) at https: //github. com/karlesmarin/tafagent. Single-sentence position: "Attention is not learned arbitrarily; it follows a constrained scaling law that can be exploited for design, efficiency, and reasoning. "

Ask AI

Helpful

Bookmark

View Full Paper