What question did this study set out to answer?

This research aims to test the prediction that per-head conformal weight in transformers correlates with attention valley depth.

April 19, 2026Open Access

A Pre-Registered Test of Boundary Conformal Field Theory in Transformer Attention: Per-head conformal weight predicts long-range attention valley depth in 6 of 7 decoder-only models

Read Full Paperexternally

Key Points

This research aims to test the prediction that per-head conformal weight in transformers correlates with attention valley depth.
Analyzed causal attention in decoder-only transformers using a pre-registered prediction framework.
Measured per-head conformal weight from short-context random-token attention.
Tested correlation between conformal weight and valley depth in seven transformer models.
Confirmed correlation in six out of seven models with Spearman ρ values indicating strong correlation.
Pythia-2.8B falsified the prediction with a lower correlation of ρ = +0.46, localized to specific layers.
Most conformal heads preferred the BCFT model over a simple power law, revealing structural insights.

Abstract

Causal attention in trained transformers exhibits power-law decay in attention weight as a function of query–key separation, with a position-dependent enhancement near the start of the sequence. We previously interpreted this as the two-point function of a boundary conformal field theory (BCFT) on a strip whose left edge is the start of the sequence. The framework predicts that per-head conformal weight Δ, measured from short-context random-token attention, should positively rank-correlate across heads with a long-range "valley depth" measure related to the "lost-in-the-middle" phenomenon. We pre-registered the prediction Spearman ρ (Δ, valley) ≥ 0. 50, p ≤ 10⁻⁵, and tested it on seven decoder-only transformers (Pythia-410m/1. 4B/2. 8B, GPT-Neo-2. 7B, Qwen2-7B, OLMo-7B, Mistral-7B-v0. 3). Six confirmed; Pythia-2. 8B falsified at ρ = +0. 46. A per-layer diagnostic localizes the falsification to layers 22–27 and shows that GPT-Neo-2. 7B (the matched control: same parameter count, same training data, different training recipe) confirms with ρ = +0. 96 across all 32 layers. Fitting the full BCFT functional form (3 parameters per head: C, Δ, λ) on Pythia-2. 8B and GPT-Neo-2. 7B reveals that 88–94% of conformal heads prefer BCFT over the bare power law, that λ is mostly positive and well structured, and that ΔBCFT is closer to the SYK Δ = 1/4 prediction than ΔPL. However, the pre-registered scalar ρ (Δ, valley) becomes weaker with the cleaner ΔBCFT, while the joint (Δ, λ) → valley rank-regression explains substantially more variance. Two findings demand explanation: (i) ρ (λ, valley) is mostly negative across layers in both models, the opposite of the framework's prediction; (ii) GPT-Neo-2. 7B exhibits an alternating-layer pattern with two distinct populations of heads by boundary structure. We discuss what this changes about the framework, what we would do differently in a follow-up pre-registration, and where the most informative remaining tests lie. All code, raw per-head data, and the pre-registration document are in the public repository at Capacity-For-Evil/ariel.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ariel Umphrey

Mission Heritage Medical Group

Eldon Umphrey

Mission Heritage Medical Group

Actions

Institutions

Mission Heritage Medical Group

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Pre-Registered Test of Boundary Conformal Field Theory in Transformer Attention: Per-head conformal weight predicts long-range attention valley depth in 6 of 7 decoder-only models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider