We give an exact mechanical interpretation of the cosine Semantic Tube Prediction (STP) loss of Huang, LeCun, and Balestriero (2026) and use it to construct a static architectural diagnostic for the global Lagrangian structure of pretrained transformers. The mathematical core is the closed-form identity LSTP = 1 − √ (1 − |a⊥|² / |d₂|²), which relates the cosine STP loss between three consecutive hidden states to the magnitude of the component of acceleration normal to the trajectory; we prove the identity algebraically and verify it to machine precision across 1, 314 consecutive triplets on GPT-2. Two structural consequences follow immediately: STP loss is blind to the tangential component of acceleration, which we measure to be approximately twice the normal component on GPT-2, and deceleration along the trajectory is the direct mechanical signature of a learned restoring force, with 97. 9% of triplets decelerating and a permutation null at z < −11 establishing that sequential token order produces significantly smoother trajectories than random orderings. Both findings replicate on Pythia-160M. These descriptive results establish local hallmarks of damped Lagrangian flow on a bounded attractive potential as a property of trained attention transformers. The natural follow-up — whether the per-layer hidden-state updates globally derive from a single shared scalar potential — is empirically testable. We fit the joint regression Δh_ℓ ≈ α_ℓ · v_ℓ − β_ℓ · ∇V_ψ (h_ℓ) across all layers with a single learned V_ψ on frozen checkpoints, and obtain a three-way architectural separator on Tiny Shakespeare: a weight-tied scalar-potential autoregressive flow (used here as a positive control) attains median per-layer test R² = 0. 957 with a uniform layer profile; a scale- and data-matched 8M-parameter GPT-2-style decoder reaches R² = 0. 54 with monotonic decay; pretrained GPT-2 small reaches R² = 0. 46 with a bathtub profile (middle-band mean 0. 09). Three independent controls rule out alternative explanations: an oracle fit using the positive control's own potential attains R² = 0. 931 on the LayerNorm variant (and 1. 0000 on the base architecture) ; a V_ψ capacity sweep over a 16× parameter band rules out expressivity limitation; and the separation reproduces under orthogonal token-direction coordinates. All three architectures pass a velocity-aware Jacobian-symmetry test at PCA-16, making local per-step conservativity universal while global shared-potential structure remains architectural. We therefore characterise the Lagrangian content of trained attention transformers as locally conservative but globally not derivable from a shared scalar potential. We provide both a sharper reading of what STP-style training objectives do and do not measure, and a reproducible architectural diagnostic that other globally-Lagrangian designs can be evaluated against.
Dimitar Gueorguiev (Sat,) studied this question.