What question did this study set out to answer?

This research aims to connect various theoretical aspects of nonlinear approximation and their implications for deep learning models, particularly transformers.

June 20, 2026Open Access

Stability, Expressivity, and Spectra

Key Points

This research aims to connect various theoretical aspects of nonlinear approximation and their implications for deep learning models, particularly transformers.
Developed a theorem-based framework connecting nonlinear approximation and gradient-flow reachability.
Introduced finite-sample results and a uniform risk bound for classes generated from training samples.
Conducted eight reproducible numerical experiments to test various aspects of transformer models.
Risk bound confirmed on independent holdout samples; complexity controlled by initialization entropy and function-space path length.
Resolved residual ball dimension scales as O(√(d_T/n)) under bounded spectral leverage, showing dependence on target energy and spectral filter.
Gradient flow and optimal error scaling provide insights for improving transformer efficiency and performance.

Abstract

We develop a theorem-based synthesis connecting nonlinear approximation, gradient-flow reachability, power-law spectral learning, compute-optimal training, and transformer attention. First, we sharpen the stability–expressivity interpretation of the deep ReLU approximation phase diagram: the fast phase uses growing depth and discontinuous target-to-weight selection, but does not invalidate parameter-count, norm, or pseudodimension bounds. Second, we introduce gradient-flow reachable-set capacity and prove two finite-sample results. A random class generated from a training sample admits a uniform risk bound when evaluated on an independent holdout sample, with complexity controlled by initialization entropy, flow stability, and function-space path length. In the fixed-spectrum regime, a Rademacher bound for the resolved residual ball scales as O (√ (dT/n) ) under bounded spectral leverage, where dT ≍ T^ (1/κ) is the resolved-mode dimension. A Duhamel perturbation theorem further transfers frozen-kernel covers and empirical Rademacher bounds to nonlinear prediction dynamics up to an explicit integrated kernel-drift term ΔT. We then replace a purely phenomenological compute ansatz by an exact controlled theorem: if λᵢ ≍ i^ (−κ), the target energy satisfies aᵢ ≍ i^ (− (1+β) ), and an order-ν spectral filter has residual exp (−t λ^ν), then under a compute budget C = Pt the optimal error — the infimum of ℰ_ν (P, t) subject to Pt ≤ C — scales as C^ (−β/ (1+κν) ), attained at P* ≍ C^ (1/ (1+κν) ) and t* ≍ C^ (κν/ (1+κν) ). Gradient flow corresponds to ν = 1; an ideal square-root accelerated filter corresponds to ν = 1/2. Finally, a high-temperature expansion of softmax attention yields an exact local covariance bridge: with keys used as values, the first nonconstant attention response is (β/√d) ·Σ̂ₖ·q, where Σ̂ₖ is the empirical key covariance, so its eigenmodes are transmitted proportionally to their eigenvalues. Eight fixed-seed, fully reproducible numerical experiments test the effective-dimension, early-stopping, optimization-time, fixed-compute, kernel-spectrum, attention-expansion, and controlled MLP comparisons. Endpoint-aware rank–density translation and an AM–GM correction to a proposed spectral-spread criterion complete the analysis. Proved statements, imported results, diagnostics, and open extensions are separated explicitly.

Stability, Expressivity, and Spectra

Key Points

Abstract

Cite This Study