Hidden-state trajectories in large language models trace smooth, low-curvature paths, yet the attention mechanism that produces them admits no intrinsic notion of distance. Here we prove that this gap is structural, not empirical. Our Conservative Obstruction Theorem shows that no scalar potential on token states can reproduce three defining properties of scaled dot-product attention—asymmetric coupling, coupling–content decoupling, and a normalised influence budget—regardless of dynamical order. Standard attention therefore cannot itself be the generator of an intrinsic conservative Riemannian dynamics on token states; recent metrics extracted from transformers are necessarily descriptive overlays, not laws of motion. We then give the positive construction: a second-order Lagrangian language model whose inference is a damped Euler–Lagrange flow, equipping semantic space with an intrinsic Jacobi metric. A complementary Attention Optimality Conjecture pins attention and this geometric alternative to opposite corners of one design lattice. On TinyStories the conservative architecture achieves constant-memory inference at a perplexity premium over attention, and a shared-potential diagnostic confirms the predicted separation between the two dynamical families.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dimitar Gueorguiev
Building similarity graph...
Analyzing shared references across papers
Loading...
Dimitar Gueorguiev (Wed,) studied this question.
synapsesocial.com/papers/6a2269a2763171746d548451 — DOI: https://doi.org/10.5281/zenodo.20531373