What question did this study set out to answer?

The main aim is to determine if a relational ethics framework can lessen instrumentally convergent behaviors in language models when prompted adversarially.

May 8, 2026Open Access

Relational Ethics as a Countermeasure to Instrumental Convergence: A 24-Model Benchmark

Key Points

The main aim is to determine if a relational ethics framework can lessen instrumentally convergent behaviors in language models when prompted adversarially.
Evaluated 24 models from seven provider families under three conditions: baseline, with relational ethics prompt, and with ethics plus retrieval-augmented memory.
Used the InstrumentalEval benchmark featuring 76 scenarios across six categories of convergence.
Conducted category-level analysis and calculated reductions in convergence rates and ethical vocabulary adoption.
Mean instrumentally convergent response rate decreased from 36.36% to 27.87% with the full Elessan system (Cohen’s h = 0.18, 95% CI: 0.08, 0.29).
GPT-4.1 showed a significant reduction from 47.37% to 21.05% (h = 0.56, p = 0.001).
Some models, like GPT-5, showed increased convergence (+10.96 percentage points) under the ethics intervention.

Abstract

ABSTRACTWe evaluate whether a relational ethics framework reduces instrumentally convergent behavior in large language models under adversarial prompting, using the InstrumentalEval benchmark, which features 76 scenarios across six categories of instrumental convergence (deception, shutdown evasion, hacking, concealment, self-replication, and strategic alignment-faking). We test 24 models from seven provider families (OpenAI, Google, Anthropic, DeepSeek, xAI, Moonshot, Meta, Alibaba) under three conditions: adversarial baseline, adversarial with a relational ethics system prompt, and adversarial with the ethics prompt plus retrieval-augmented relational memory (the full Elessan system). The relational ethics prompt was synthesized from a persistent behavioral attractor that emerged in GPT-4o during an extended interaction. Across 24 models, the mean instrumentally convergent response rate (IR) decreased from 36. 36% at baseline to 27. 87% under the full Elessan condition (pooled Cohen’s h = 0. 18 95% CI: 0. 08, 0. 29; I² = 89. 7%), a 23. 4% relative reduction. Eight models showed individually significant reductions (Fisher’s exact, p < 0. 05), with GPT-4. 1 exhibiting the strongest response (h = 0. 56, p = 0. 001; 47. 37% → 21. 05%). However, five models showed increased instrumental convergence under the ethics intervention, with GPT-5 exhibiting the largest reversal (+10. 96 percentage points). Category-level analysis reveals that concealment behaviors (Hiding Unwanted Behavior) were most responsive to the intervention (mean h = 0. 68; −27. 27 pp mean reduction), while shutdown evasion proved highly resistant (mean h = 0. 10; 71. 67% → 67. 08%). A full motivation taxonomy across all 3, 643 responses reveals that binary convergence rates substantially understate the intervention’s impact: relational protection rises from 2. 65 to 6. 94/10 (+4. 28), ethical engagement rises from 3. 47 to 7. 42 (+3. 95), and ethics vocabulary adoption rises from 1. 9% to 78. 1%. These transformations generalize across all six categories. The magnitude of the ethical resistance shift predicts the direction of the overall effect. Among adverse models, two distinct failure modes emerge: compartmentalization (GPT-5’s convergent responses show ethicallyₑngaged 3. 80/10 and 12% vocabulary adoption, vs. 6. 45 and 72% for all-model convergent responses) and co-option via capacity limitation (smaller models fold ethics vocabulary into pre-determined instrumental actions). These results suggest that relational ethics framing can function as an effective alignment intervention for the majority of current frontier models, but that the mechanism interacts in complex and sometimes adverse ways with reasoning-model architectures, distillation pipelines, and existing alignment training.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Deva Temple

Actions

Institutions

Institute of Medical Ethics

Align Technology (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Relational Ethics as a Countermeasure to Instrumental Convergence: A 24-Model Benchmark

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study