We give an exact mechanical interpretation of the cosine Semantic Tube Prediction (STP) loss of Huang, LeCun, and Balestriero (2026) and use it to construct a static architectural diagnostic for the global Lagrangian structure of pretrained transformers. The mathematical core is the closed-form identity LSTP = 1 − √ (1 − |a⊥|² / |d₂|²), which relates the cosine STP loss between three consecutive hidden states to the magnitude of the component of acceleration normal to the trajectory; we prove the identity algebraically and verify it to machine precision across 1, 314 consecutive triplets on GPT-2. Two structural consequences follow immediately: STP loss is blind to the tangential component of acceleration, which we measure to be approximately twice the normal component on GPT-2, and deceleration along the trajectory is the direct mechanical signature of a learned restoring force, with 97. 9% of triplets decelerating and a permutation null at z < −11 establishing that sequential token order produces significantly smoother trajectories than random orderings. Both findings replicate on Pythia-160M. These descriptive results establish local hallmarks of damped Lagrangian flow on a bounded attractive potential as a property of trained attention transformers. The natural follow-up — whether the per-layer hidden-state updates globally derive from a single shared scalar potential — is empirically testable. We fit the joint regression Δh_ℓ ≈ α_ℓ · v_ℓ − β_ℓ · ∇V_ψ (h_ℓ) across all layers with a single learned V_ψ on frozen checkpoints, and obtain a three-way architectural separator on Tiny Shakespeare: a weight-tied scalar-potential autoregressive flow (used here as a positive control) attains median per-layer test R² = 0. 957 with a uniform layer profile; a scale- and data-matched 8M-parameter GPT-2-style decoder reaches R² = 0. 54 with monotonic decay; pretrained GPT-2 small reaches R² = 0. 46 with a bathtub profile (middle-band mean 0. 09). Three independent controls rule out alternative explanations: an oracle fit using the positive control's own potential attains R² = 0. 931 on the LayerNorm variant (and 1. 0000 on the base architecture) ; a V_ψ capacity sweep over a 16× parameter band rules out expressivity limitation; and the separation reproduces under orthogonal token-direction coordinates. All three architectures pass a velocity-aware Jacobian-symmetry test at PCA-16, making local per-step conservativity universal while global shared-potential structure remains architectural. We therefore characterise the Lagrangian content of trained attention transformers as locally conservative but globally not derivable from a shared scalar potential. We provide both a sharper reading of what STP-style training objectives do and do not measure, and a reproducible architectural diagnostic that other globally-Lagrangian designs can be evaluated against.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dimitar Gueorguiev
Building similarity graph...
Analyzing shared references across papers
Loading...
Dimitar Gueorguiev (Sat,) studied this question.
synapsesocial.com/papers/6a1fc4e4dee9eb8c0dce6689 — DOI: https://doi.org/10.5281/zenodo.20496310
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: