Sequential generation systems that produce outputs independently at each timestep---including vision-language-action (VLA) models in robotics and per-frame motion generators in animation---exhibit pronounced temporal discontinuity even when trained on smooth demonstrations. We develop a first-principles kinematic framework that explains this phenomenon through four propositions with exact, zero-parameter predictions. Our key theoretical result is that per-step independent generation drives the velocity autocorrelation toward a universal limit of -0.5 and maximizes jerk among all same-energy error processes---the spectral worst case. We derive closed-form scaling laws: temporal ensemble reduces jerk as 12σ²/N², and action chunking dilutes boundary discontinuity as 1/K. We validate all four predictions in two domains: (i) robot manipulation, where three VLA families (OpenVLA, Octo, π₀) confirm the theory with R² > 0.99, including a controlled experiment isolating inference mechanism from model weights; and (ii) human motion generation, where a pre-registered experiment on HumanML3D yields Cohen's d = 9.0. Our framework provides design principles explaining why action chunking, diffusion, and temporal ensembling all improve smoothness, grounded in first principles rather than empirical tuning.
Woojin Jung (Mon,) studied this question.