Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions — requiring Σδₙ 1, any classifier-based gate under overlapping safe/unsafe distributions forces ΣTPRₙ 0 (Theorem 2), validated on GPT-2 (dLoRA = 147, 456). Comprehensive empirical validation is in the companion paper.
Arsenios Scrivens (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: