We develop a theorem-based synthesis connecting nonlinear approximation, gradient-flow reachability, power-law spectral learning, compute-optimal training, and transformer attention. First, we sharpen the stability–expressivity interpretation of the deep ReLU approximation phase diagram: the fast phase uses growing depth and discontinuous target-to-weight selection, but does not invalidate parameter-count, norm, or pseudodimension bounds. Second, we introduce gradient-flow reachable-set capacity and prove two finite-sample results. A random class generated from a training sample admits a uniform risk bound when evaluated on an independent holdout sample, with complexity controlled by initialization entropy, flow stability, and function-space path length. In the fixed-spectrum regime, a Rademacher bound for the resolved residual ball scales as O (√ (dT/n) ) under bounded spectral leverage, where dT ≍ T^ (1/κ) is the resolved-mode dimension. A Duhamel perturbation theorem further transfers frozen-kernel covers and empirical Rademacher bounds to nonlinear prediction dynamics up to an explicit integrated kernel-drift term ΔT. We then replace a purely phenomenological compute ansatz by an exact controlled theorem: if λᵢ ≍ i^ (−κ), the target energy satisfies aᵢ ≍ i^ (− (1+β) ), and an order-ν spectral filter has residual exp (−t λ^ν), then under a compute budget C = Pt the optimal error — the infimum of ℰ_ν (P, t) subject to Pt ≤ C — scales as C^ (−β/ (1+κν) ), attained at P* ≍ C^ (1/ (1+κν) ) and t* ≍ C^ (κν/ (1+κν) ). Gradient flow corresponds to ν = 1; an ideal square-root accelerated filter corresponds to ν = 1/2. Finally, a high-temperature expansion of softmax attention yields an exact local covariance bridge: with keys used as values, the first nonconstant attention response is (β/√d) ·Σ̂ₖ·q, where Σ̂ₖ is the empirical key covariance, so its eigenmodes are transmitted proportionally to their eigenvalues. Eight fixed-seed, fully reproducible numerical experiments test the effective-dimension, early-stopping, optimization-time, fixed-compute, kernel-spectrum, attention-expansion, and controlled MLP comparisons. Endpoint-aware rank–density translation and an AM–GM correction to a proposed spectral-spread criterion complete the analysis. Proved statements, imported results, diagnostics, and open extensions are separated explicitly.
Miquel Noguer Alonso (Thu,) studied this question.