What question did this study set out to answer?

The aim is to extend the Thermodynamic Attention Framework (TAF) with empirical predictions and contributions that inform transformer model precision.

May 3, 2026Open Access

View Full Paper

Predicting How Transformers Attend, Part II: A Six-Axis Decomposition with the Learned Imprint ν = -1/(2π), Sink-Dominated Precision Boundaries, Bimodal Phase Structure, and Honest Revisions

CMCARLES MARÍN MUÑOZ

Key Points

The aim is to extend the Thermodynamic Attention Framework (TAF) with empirical predictions and contributions that inform transformer model precision.
Introduced analytical predictions for 4-bit NF4 inference using a precision-direction rule.
Validated model correlation with statistical analysis using power-law fits and random initiations.
Machine verification of algebraic identities through dual-tool approaches.
Achieved R² of 0.30 for architectural correlation but failed out-of-sample predictive success.
First transformer-attention paper to fully verify algebraic identities through algorithmic approaches.
Seven claims were re-evaluated and demoted, leading to a more robust framework.

Abstract

Companion paper to "Predicting How Transformers Attend" (Marín 2026, Zenodo DOI 10. 5281/zenodo. 19826343), which introduced the Thermodynamic Attention Framework (TAF) and the closed-form predictor γPadé (θ, T) = (2θ - T√2) / (2θ + T√2) for the attention-decay exponent γ. This paper presents a phenomenological extension of TAF with five constructive contributions (ordered by empirical strength): (1) A precision-direction rule for 4-bit NF4 inference on full multi-head attention: the R² of the bf16 power-law fit predicts the sign of Δγ₄-₁₈ₓ − ₁₅₁₆. Sign-correct on 5/5 paired bf16/4-bit measurements (DeepSeek-7B-base, DeepSeek-7B-chat, Pythia-2. 8B, Pythia-1B, Llama-3-8B, Qwen2. 5-7B-Instruct) — a deployment heuristic for practitioners serving 4-bit inference. (2) A learned-imprint axis with slope ν ≈ -1/ (2π), supported by three convergent arguments and a random-init falsifier (Pythia 70M/410M/1B at random init, p = 0. 44). Honest caveats: bootstrap CI is wide (-0. 260, -0. 008) and a Pythia-70M trajectory across 9 checkpoints does not monotonically converge. (3) An algebraic decomposition of the Cardy-like entropy anomaly: ΔHPadé (γ) = log (z/2) + 2·arctanh (γ), linearising empirically with slope ≈ 5 across the panel. (4) A bimodal phase structure of γₜext across the panel, with ~36% of measured LLMs sitting at γ ≥ 1 (Hagedorn zone), reframed as an industrial GQA-design correlate rather than a phase attractor. (5) Machine-verified algebraic backbone: all 15 algebraic identities of the framework verified by both Sage Groebner basis and Lean Mathlib4, including a previously-unstated quadratic identity D-SAGE-1: 2η² + η·γ_χ + 1 = 0. To our knowledge, this is the first transformer-attention paper with end-to-end dual-tool machine verification of its algebraic content. A separate correlation finding sits below the constructive contributions: the architectural concentration relation γₜext ≈ γPadé − 0. 012·nₖv reproduces an in-sample R² = 0. 30 (vs R² = 0. 02 for Padé alone) but fails out-of-sample (median 70/30 hold-out R² ≈ -0. 09 over 1000 random splits; family-leave-out aggregate R² = -0. 027). The nₖv coefficient is statistically significant (bootstrap CI excludes zero) and the relative improvement over Padé alone is robust (+0. 20 R² family-LOO), but absolute predictive power is essentially nil. We report it as a cross-panel correlation structure, not a predictive law — a down-grade documented internally rather than discovered by reviewers. A symmetric set of honest revisions accompanies the constructive material: seven claims from paper I or paper II drafts are withdrawn or demoted (Rc* ≈ 1. 68 as a sharp boundary, γ = 1 - 1/φ as a code-tuning attractor, the Mittag-Leffler prefactor 1/Γ (1-γ), the universal soft-decay KV truncation rule, the "0. 3% match" framing for ν, the dₕorizon "law" which we show is algebraically ≡ T when γ matches Padé, and the κ·Nₛem topological invariant). The framework emerges leaner and more honest. The accompanying public dataset (karlexmarin/taf-attention-decay on HuggingFace, 79 records across 33 models, CC-BY-4. 0) and the diagnostic tool (karlexmarin/taf-agent on HuggingFace Spaces) provide reproducible measurements and operational recipes. The audit methodology used to produce this paper has been spun off into a standalone domain-agnostic framework (Sócrates Audit & Discovery Framework — separate release) for use across physics, chemistry, biology, mathematics, ML, and social sciences. This release includes both the English version (73 pages) and the Spanish version (76 pages) of the paper.

Ask AI

Helpful

Bookmark

View Full Paper

Ask AI

Helpful

Bookmark

View Full Paper

Predicting How Transformers Attend, Part II: A Six-Axis Decomposition with the Learned Imprint ν = -1/(2π), Sink-Dominated Precision Boundaries, Bimodal Phase Structure, and Honest Revisions

Key Points

Abstract

Cite This Study