We present a factorial ablation methodology for causally isolating runtime alignment mechanisms in AI systems with modular safety components. A fully-crossed 3×2×2 design (gate type × temptation generator × ledger state), extended to 4×2×2 with a sham gate, across 11,700 trials establishes the normative gate as the dominant factor (η²p = 0.924, p < 10⁻¹⁰). A learned safety projection (23M-parameter encoder + 3 linear heads) achieves 99.4% recall on 720 entirely unseen benchmark items (HarmBench, AdvBench, SimpleSafetyTests). An adversarial paraphrase protocol (500 paraphrases, 5 evasion strategies, κ = 0.84) eliminates keyword circularity (88.4% semantic vs 0% regex on zero-trigger-word trials). The methodology is validated across four architectures (three modular, one non-modular) including two fully independent replications with zero author involvement. Honest boundary conditions are reported: GCG evasion (94%), LLM adaptive adversary evasion (46%), human red-team evasion (51.3%). The contribution is the methodology for measuring these properties, not the mechanism's robustness.
Matija Ludvig (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: