What question did this study set out to answer?

The aim is to characterize the Lagrangian structure of pretrained transformers and understand the implications of STP loss.

June 3, 2026Open Access

Locally Conservative, Globally Not: Diagnosing the Lagrangian Structure of Pretrained Transformers

Key Points

The aim is to characterize the Lagrangian structure of pretrained transformers and understand the implications of STP loss.
Constructed a diagnostic based on cosine STP loss and analyzed it across 1,314 triplets from GPT-2.
Fitted a joint regression model on frozen checkpoints to assess layer-wise updates under a shared potential.
Used independent controls to validate findings against alternative architectural explanations.
97.9% of triplets showed deceleration, with higher tangential acceleration measured on GPT-2.
A weight-tied scalar-potential flow achieved a median per-layer test R² of 0.957, outperforming other architectures.
All architectures passed the velocity-aware Jacobian-symmetry test, indicating local conservativity.

Abstract

We give an exact mechanical interpretation of the cosine Semantic Tube Prediction (STP) loss of Huang, LeCun, and Balestriero (2026) and use it to construct a static architectural diagnostic for the global Lagrangian structure of pretrained transformers. The mathematical core is the closed-form identity LSTP = 1 − √ (1 − |a⊥|² / |d₂|²), which relates the cosine STP loss between three consecutive hidden states to the magnitude of the component of acceleration normal to the trajectory; we prove the identity algebraically and verify it to machine precision across 1, 314 consecutive triplets on GPT-2. Two structural consequences follow immediately: STP loss is blind to the tangential component of acceleration, which we measure to be approximately twice the normal component on GPT-2, and deceleration along the trajectory is the direct mechanical signature of a learned restoring force, with 97. 9% of triplets decelerating and a permutation null at z < −11 establishing that sequential token order produces significantly smoother trajectories than random orderings. Both findings replicate on Pythia-160M. These descriptive results establish local hallmarks of damped Lagrangian flow on a bounded attractive potential as a property of trained attention transformers. The natural follow-up — whether the per-layer hidden-state updates globally derive from a single shared scalar potential — is empirically testable. We fit the joint regression Δh_ℓ ≈ α_ℓ · v_ℓ − β_ℓ · ∇V_ψ (h_ℓ) across all layers with a single learned V_ψ on frozen checkpoints, and obtain a three-way architectural separator on Tiny Shakespeare: a weight-tied scalar-potential autoregressive flow (used here as a positive control) attains median per-layer test R² = 0. 957 with a uniform layer profile; a scale- and data-matched 8M-parameter GPT-2-style decoder reaches R² = 0. 54 with monotonic decay; pretrained GPT-2 small reaches R² = 0. 46 with a bathtub profile (middle-band mean 0. 09). Three independent controls rule out alternative explanations: an oracle fit using the positive control's own potential attains R² = 0. 931 on the LayerNorm variant (and 1. 0000 on the base architecture) ; a V_ψ capacity sweep over a 16× parameter band rules out expressivity limitation; and the separation reproduces under orthogonal token-direction coordinates. All three architectures pass a velocity-aware Jacobian-symmetry test at PCA-16, making local per-step conservativity universal while global shared-potential structure remains architectural. We therefore characterise the Lagrangian content of trained attention transformers as locally conservative but globally not derivable from a shared scalar potential. We provide both a sharper reading of what STP-style training objectives do and do not measure, and a reproducible architectural diagnostic that other globally-Lagrangian designs can be evaluated against.

Locally Conservative, Globally Not: Diagnosing the Lagrangian Structure of Pretrained Transformers

Key Points

Abstract

Cite This Study