What question did this study set out to answer?

To understand why sequential generation systems produce noticeable temporal breaks despite smooth training data.

March 18, 2026Open Access

Why Independent Sequential Generators Tremble: A First-Principles Kinematic Analysis with Cross-Domain Validation

Key Points

To understand why sequential generation systems produce noticeable temporal breaks despite smooth training data.
Developed a kinematic framework with four theoretical propositions to explain observations.
Derive scaling laws for jerk and boundary discontinuity based on generation methods.
Validated predictions in robotics and human motion generation using various models and experiments.
Velocity autocorrelation approaches a limit of -0.5 with independent generation.
Jerk scales as 12σ²/N², improving with temporal ensemble methods.
Action chunking reduces boundary discontinuity by a factor of 1/K, leading to smoother outputs.

Abstract

Sequential generation systems that produce outputs independently at each timestep---including vision-language-action (VLA) models in robotics and per-frame motion generators in animation---exhibit pronounced temporal discontinuity even when trained on smooth demonstrations. We develop a first-principles kinematic framework that explains this phenomenon through four propositions with exact, zero-parameter predictions. Our key theoretical result is that per-step independent generation drives the velocity autocorrelation toward a universal limit of -0.5 and maximizes jerk among all same-energy error processes---the spectral worst case. We derive closed-form scaling laws: temporal ensemble reduces jerk as 12σ²/N², and action chunking dilutes boundary discontinuity as 1/K. We validate all four predictions in two domains: (i) robot manipulation, where three VLA families (OpenVLA, Octo, π₀) confirm the theory with R² > 0.99, including a controlled experiment isolating inference mechanism from model weights; and (ii) human motion generation, where a pre-registered experiment on HumanML3D yields Cohen's d = 9.0. Our framework provides design principles explaining why action chunking, diffusion, and temporal ensembling all improve smoothness, grounded in first principles rather than empirical tuning.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Woojin Jung (Mon,) studied this question.

synapsesocial.com/papers/69ba434a4e9516ffd37a465d https://doi.org/https://doi.org/10.5281/zenodo.19050964

Bookmark

View Full Paper