We present Semantic Simulation as a prescriptive Lagrangian framework: a theory whose primary purpose is to specify how to construct a language-model circuit whose inference dynamics is, by design, a damped Euler–Lagrange flow on a single learned scalar energy field — and hence to propose a concrete alternative design principle for efficient semantic inference. Meaning is modeled as the configuration and motion of discrete units — semantic aspects, properties, particles, and structures — in a metric space Σ equipped with a semantic energy field F, with inference realised as motion of a semantic particle through a bounded attractive scalar potential, parameterised in this paper as a Gaussian energy well for analytic tractability. The framework is built from four formal ingredients: (i) a signature matrix representation P = MX of semantic properties whose singular-value decomposition yields an information content H* and a feasibility ellipsoid S*; (ii) a semantic energy well drawn from the class of bounded attractive potentials, developed here as a Gaussian well V (x) = m·υ² (1 − e^−κ²x²) for analytic tractability (υ = √ (Eₜ/m), κ = f/υ) ; (iii) pairwise Property-Attractive–Repulsive Forces (PARF) and Structure-Attractive–Repulsive Forces (SARF) ; and (iv) a complete Lagrangian L = T − V whose Euler–Lagrange equations reproduce damped second-order dynamics with a tanh braking term near bound states. As a specialisation, the framework provides a precise theoretical underpinning for Joint Embedding Predictive Architectures (JEPA), and in particular for the Semantic Tube Prediction (STP) regularizer of Huang, LeCun, and Balestriero (2026). We establish an exact algebraic identity STP = 1 − √ (1 − |a⊥|²/|d₂|²) relating the STP loss to the normal component of hidden-state acceleration, with a conjectured mapping in which a last-layer hidden state represents a semantic property, a phrase a semantic particle, and a sentence a semantic structure. Descriptive validation on pretrained transformers. Verifying the identity on GPT-2 to machine precision across 1, 314 consecutive triplets, we find that tangential acceleration is approximately twice the normal component (invisible to the STP loss) and that 97. 9% of triplets decelerate — the signature of a property approaching a bound state under a restoring force and a damping field — replicating on Pythia-160M. A permutation null test (z < −11 for both components) establishes that sequential order produces significantly smoother trajectories than random orderings, quantifying the near-geodesic character of learned transformer representations. These are descriptive findings: trained attention transformers exhibit local hallmarks of the framework's dynamics. Negative results on conservative retro-fitting. The stronger, prescriptive content of the framework — that the per-layer force field derives from a single shared scalar potential — is not realised by attention. We close the natural Lagrangian fitting menu on held-out GPT-2 trajectories: seven scalar-potential functional forms (harmonic, Gaussian, Morse, Lorentzian, log-saturation, Weibull, power), a linear Helmholtz position-coupled skew augmentation Ω·x, and a velocity-coupled electromagnetic-analogue gauge B (x) ·ẋ at constant, affine-rank-1, and affine-rank-2 position dependence all tie the static-null floor on held-out data; the train-optimal gauge shrinkage collapses to zero. Six architectural features of standard decoder blocks — asymmetric WQ ≠ WKT, multi-head concatenation, causal mask over prefix history, LayerNorm, distinct W^ (ℓ) per layer, and softmax — each independently obstruct conservativity and jointly render the per-layer force non-autonomous in layer index and token context; the attention circuit is therefore outside the class of autonomous, shared-potential Helmholtz (grad + curl) decompositions by design, though at each fixed layer and context it remains locally a Hopfield gradient plus a small skew correction. Prescriptive contribution: conservative-by-construction language models and the shared-potential separator. In response we develop the prescriptive consequence of the framework: the scalar-potential language model (SPLM), a weight-tied autoregressive circuit whose inference is a damped Euler–Lagrange flow on a single learned scalar V_θ (ξ, h) with a causal cumulative-mean context pool ξ, no attention, no multi-head decomposition, no softmax, and no per-layer distinct parameters. The strict shared-potential diagnostic, fitting the per-layer update Δh_ℓ ≈ α_ℓ v_ℓ − β_ℓ ∇V_ψ (h_ℓ) jointly across all layers with a single learned V_ψ, produces a three-way architectural separator on the same Tiny Shakespeare corpus: SPLM at median per-layer test R² = 0. 90 (uniform profile) ; a scale- and data-matched 8M-parameter GPT-2-style decoder at R² = 0. 56 (monotonic decay) ; pretrained GPT-2 small at R² = 0. 45 (bathtub, middle-band mean 0. 09). A V_ψ capacity sweep over a 16× parameter band shows the attention failure is structural, not expressivity-limited; an oracle fit using SPLM's own V_θ (ξ, h) attains R² = 1. 0000 on every layer and recovers the learned integrator constants to four decimal places; and the separator reproduces under the orthogonal token-direction coordinate system, where only SPLM beats the velocity-only baseline by a non-trivial margin. All three architectures pass a velocity-aware Jacobian-symmetry test at PCA-16, making local per-step conservativity universal while global shared-potential structure remains architectural. The framework is therefore prescriptive in a technically precise sense: trained attention transformers are locally conservative but globally not derivable from a shared scalar potential, whereas a Lagrangian-integrator architecture realises the Semantic Simulation dynamics by design. v4 contribution: a causally-audited family of SPLM-based architectures. Every architecture reported in v4 ships with a startup causal-violation probe that tests, before any optimiser step, that no output position t' < t depends on the embedding at position t — both as a perturbation test (Δ = 0 at strict 1e-6 tolerance) and as a gradient-Jacobian test (∂logitsₓ'/∂embedₜ ≡ 0 for t' < t). Seven SPLM-family architectures under the v4 umbrella are jointly audited and pass the probe by construction: vanilla SPLM, multi-channel-ξ SPLM (HiPPO-LegT, K-EMA, S4D), the layer-type Helmholtz hybrid (Q9d), the two-stage Variant A and Variant B hybrids, and the PARF-augmented SPLM (Q9c) with its two. detach () causal-reduction points (the inherited ξ-pool detach and the new pair-source hₛ detach). The descriptive findings on pretrained GPT-2 and Pythia are independent of any SPLM integrator and stand unchanged. Multi-seed language-model-quality ceiling for vanilla SPLM. Vanilla SPLM trains stably (no NaN divergences across the pilot configurations) and learns a non-trivial causal language model. At a 4000-step, 16. 5M-parameter pilot on TinyStories (LayerNorm-after-step configuration, multi-channel cumulative-mean context ξ), we measure val perplexity 14. 78 for the best multi-channel SPLM variant (4-channel exponential moving-average context, learnable decays). A scale- and budget-matched attention baseline on the same corpus reaches valₚpl ~8. Vanilla SPLM is therefore not competitive on PPL with attention at this configuration — the gap is ~7 PPL, roughly 1. 85×. The contribution at this arm is structural: the dynamics is implementable, trains cleanly, is exactly causal, and the gap is mechanistically attributable rather than mysterious. Information-bottleneck programme: a mechanistic decomposition of the vanilla-SPLM-vs-attention gap. The natural follow-up question — why does vanilla SPLM trail attention on PPL by exactly this much, and which structural primitive in attention is responsible for each PPL of the gap? — is testable once the vanilla SPLM ceiling above is fixed. We construct a ladder of multi-channel context summaries spanning the information-redundancy axis from K parallel exponential moving averages (K-EMA, the v2 default) through structured-orthogonal bases (HiPPO-LegT, learnable-Δt HiPPO, S4D with learnable diagonal complex A and B) and re-train each at identical configuration. The binding constraint is not the channel summary's information content but the downstream MLP head's fit difficulty: the smooth, multi-scale, partially-redundant K-EMA bank is the inductive bias V_θ as an MLP can extract from at this token budget. v4 reframing: vanilla SPLM as the Lagrangian counterfactual; the SPLM family as gap-closing competitors. The v4 empirical picture splits the framework's prescriptive content into two complementary contributions. Vanilla SPLM is preserved as the maximally-structured Lagrangian counterfactual: attention's ~7-PPL advantage at matched scale measures the cost of the maximally-autonomous commitment (single shared V_θ, weight tying, no pair coupling). The SPLM family as a whole — vanilla SPLM, the multi-channel-ξ variants, the hybrid Q9d / Variant A / Variant B, and the PARF-augmented Q9c — additionally contains gap-closing competitors that retain the prescribed structure while admitting either a finite, measurable budget of attention blocks (the hybrid arms) or the framework's own pair force law as the routing primitive (PARF). Five structural properties are SPLM-family-wide and absent from a vanilla attention decoder: (i) provable causality; (ii) approximate energy conservation under the second-order Euler–Lagrange flow on every S-block; (iii) mechanical interpretability of S-block token trajectories (position hₜ, velocity vₜ, learned per-token mass mₜ, explicit potential V_θ (ξₜ, hₜ) ) ; (iv) the transferable mechanistic finding that the V_θ-fit-difficulty bottleneck applies to any downstream MLP consuming a multi-channel context summary, not just SPLM's; and (v) the basis-class hierarchy at LM scale, the first careful comparison of K-EMA / HiPPO-LegT / S4D as drivers of
Building similarity graph...
Analyzing shared references across papers
Loading...
Dimitar Gueorguiev
Building similarity graph...
Analyzing shared references across papers
Loading...
Dimitar Gueorguiev (Thu,) studied this question.
synapsesocial.com/papers/6a0567d2a550a87e60a200f0 — DOI: https://doi.org/10.5281/zenodo.20138055