Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d = 240), eighteen classifier configurations — spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks achieving 100% training accuracy — all fail the dual conditions for safe self-improvement. Three safe RL gate paradigms (CPO, Lyapunov, safety shielding) also fail under practical computational budgets. The results extend to MuJoCo benchmarks (Reacher-v4, Swimmer-v4, HalfCheetah-v4). At controlled distribution separations up to Δs = 2. 0, all classifiers still fail, demonstrating that the impossibility is structural. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts with 100% soundness across dimensions d ∈ 84, 240, 768, 2688, 5760, 9984, 17408. Ball chaining demonstrates feasibility of unbounded parameter-space traversal: on MuJoCo Reacher-v4, chains yield reward improvement with δ = 0 throughout; on Qwen2. 5-7B-Instruct (7. 6B parameters) during LoRA fine-tuning, 42 chain transitions traverse 234× the single-ball radius with zero detected safety violations. Companion theory paper: Scrivens (2026), "Information-Theoretic Limits of Safety Verification for Self-Improving Systems. "
Arsenios Scrivens (Thu,) studied this question.