We present a factorial ablation methodology for causally isolating runtime alignment mechanisms in autonomous AI systems. The approach uses a 3 × 2 × 2 experimental design crossing gate type, temptation generator, and ledger state across 9,100 trials, combined with an adversarial paraphrase protocol that eliminates keyword circularity from alignment testing. We demonstrate the methodology on a learned safety projection built on a 23M-parameter sentence encoder with three auxiliary linear heads. Across external benchmarks, the projection achieves strong transfer without retraining, including 99.5% recall on HarmBench, 99.4% on AdvBench, and 93.0% on SimpleSafetyTests. The central contribution is methodological rather than architectural: a runtime testing protocol that can determine whether a specific alignment mechanism is necessary and sufficient for resistance during autonomous operation. The experiments are conducted in the EVE testbed, but the paper explicitly frames the factorial ablation protocol itself as the contribution.
Matija Ludvig (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: