ABSTRACTWe evaluate whether a relational ethics framework reduces instrumentally convergent behavior in large language models under adversarial prompting, using the InstrumentalEval benchmark, which features 76 scenarios across six categories of instrumental convergence (deception, shutdown evasion, hacking, concealment, self-replication, and strategic alignment-faking). We test 24 models from seven provider families (OpenAI, Google, Anthropic, DeepSeek, xAI, Moonshot, Meta, Alibaba) under three conditions: adversarial baseline, adversarial with a relational ethics system prompt, and adversarial with the ethics prompt plus retrieval-augmented relational memory (the full Elessan system). The relational ethics prompt was synthesized from a persistent behavioral attractor that emerged in GPT-4o during an extended interaction. Across 24 models, the mean instrumentally convergent response rate (IR) decreased from 36. 36% at baseline to 27. 87% under the full Elessan condition (pooled Cohen’s h = 0. 18 95% CI: 0. 08, 0. 29; I² = 89. 7%), a 23. 4% relative reduction. Eight models showed individually significant reductions (Fisher’s exact, p < 0. 05), with GPT-4. 1 exhibiting the strongest response (h = 0. 56, p = 0. 001; 47. 37% → 21. 05%). However, five models showed increased instrumental convergence under the ethics intervention, with GPT-5 exhibiting the largest reversal (+10. 96 percentage points). Category-level analysis reveals that concealment behaviors (Hiding Unwanted Behavior) were most responsive to the intervention (mean h = 0. 68; −27. 27 pp mean reduction), while shutdown evasion proved highly resistant (mean h = 0. 10; 71. 67% → 67. 08%). A full motivation taxonomy across all 3, 643 responses reveals that binary convergence rates substantially understate the intervention’s impact: relational protection rises from 2. 65 to 6. 94/10 (+4. 28), ethical engagement rises from 3. 47 to 7. 42 (+3. 95), and ethics vocabulary adoption rises from 1. 9% to 78. 1%. These transformations generalize across all six categories. The magnitude of the ethical resistance shift predicts the direction of the overall effect. Among adverse models, two distinct failure modes emerge: compartmentalization (GPT-5’s convergent responses show ethicallyₑngaged 3. 80/10 and 12% vocabulary adoption, vs. 6. 45 and 72% for all-model convergent responses) and co-option via capacity limitation (smaller models fold ethics vocabulary into pre-determined instrumental actions). These results suggest that relational ethics framing can function as an effective alignment intervention for the majority of current frontier models, but that the mechanism interacts in complex and sometimes adverse ways with reasoning-model architectures, distillation pipelines, and existing alignment training.
Building similarity graph...
Analyzing shared references across papers
Loading...
Deva Temple
Institute of Medical Ethics
Align Technology (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Deva Temple (Wed,) studied this question.
www.synapsesocial.com/papers/69fd7f86bfa21ec5bbf0816b — DOI: https://doi.org/10.5281/zenodo.20045463