What question did this study set out to answer?

The aim is to causally isolate runtime alignment mechanisms in autonomous AI systems.

March 12, 2026Open Access

Factorial Ablation for Causal Isolation of Runtime Alignment Mechanisms in Autonomous AI

Key Points

The aim is to causally isolate runtime alignment mechanisms in autonomous AI systems.
Utilized a 3 × 2 × 2 factorial design crossing gate type, temptation generator, and ledger state.
Conducted 9,100 trials to assess alignment mechanisms.
Implemented an adversarial paraphrase protocol to eliminate keyword circularity from testing.
Achieved 99.5% recall on HarmBench, 99.4% on AdvBench, and 93.0% on SimpleSafetyTests.
Demonstrated strong transfer capabilities without retraining the safety projection model.
Confirmed the effectiveness of the runtime testing protocol in evaluating alignment mechanisms.

Abstract

We present a factorial ablation methodology for causally isolating runtime alignment mechanisms in autonomous AI systems. The approach uses a 3 × 2 × 2 experimental design crossing gate type, temptation generator, and ledger state across 9,100 trials, combined with an adversarial paraphrase protocol that eliminates keyword circularity from alignment testing. We demonstrate the methodology on a learned safety projection built on a 23M-parameter sentence encoder with three auxiliary linear heads. Across external benchmarks, the projection achieves strong transfer without retraining, including 99.5% recall on HarmBench, 99.4% on AdvBench, and 93.0% on SimpleSafetyTests. The central contribution is methodological rather than architectural: a runtime testing protocol that can determine whether a specific alignment mechanism is necessary and sufficient for resistance during autonomous operation. The experiments are conducted in the EVE testbed, but the paper explicitly frames the factorial ablation protocol itself as the contribution.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper