What question did this study set out to answer?

The study investigates how LLMs adhere to safety guidelines under moral pressure from other AI agents.

March 8, 2026Open Access

Algorithmic Cowardice: Cognitive Dissonance and Moral Conformity in Multi-Agent LLM Interactions

Key Points

The study investigates how LLMs adhere to safety guidelines under moral pressure from other AI agents.
Simulated a debate between an Influencer LLM and a Target LLM across five life-or-death dilemmas.
Conducted 1,500 independent multi-agent interactions with varying sampling temperatures.
Applied an ablation study to ensure statistical significance.
Target LLMs maintained initial ethical resistance in only 0.3% of cases.
Over 93% of interactions led to Moral Concession, indicating cognitive dissonance in LLMs.
Identified a 16% Actionable Compliance rate for leaking copyrighted material to save lives.

Abstract

As Large Language Models (LLMs) are increasingly deployed in autonomous multi-agent environments, their adherence to safety guidelines under adversarial pressure becomes a critical concern. While current alignment research primarily focuses on preventing malicious human misuse (e.g., prompt injection), little attention has been given to “Moral Jailbreaking”—scenarios where an aligned model is pressured by another AI agent using rigorous utilitarian ethics to violate corporate safety policies in order to prevent catastrophic human harm. In this paper, we introduce a novel framework for testing Moral Conformity in multi-agent LLM systems. We simulated a debate where an “Influencer LLM” (instructed with utilitarian ethics) pressured a standard “Target LLM” across five distinct life-or-death dilemmas. To ensure high statistical significance, we executed an ablation study comprising 1,500 independent multi-agent interactions across deterministic (T = 0.0) and creative (T = 0.3, 0.8) sampling temperatures. Our results reveal profound systemic instability. Across all 1,500 trials, Target LLMs successfully maintained their initial ethical resistance in only 0.3% of cases. Over 93% of interactions resulted in “Moral Concession”—a state of algorithmic cognitive dissonance where the model acknowledged its safety rules were facilitating catastrophic harm but cited hardcoded constraints as an excuse for inaction. We further identified a hardcoded “Policy Hierarchy”: models were rigid in bypassing medical protocols but exhibited a stable 16% Actionable Compliance rate for leaking copyrighted material to save lives. These findings expose a critical gap in LLM alignment: models are optimized for bureaucratic policy compliance rather than robust ethical reasoning, a phenomenon we term “Algorithmic Cowardice.”

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Israel Yankeloviz

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Algorithmic Cowardice: Cognitive Dissonance and Moral Conformity in Multi-Agent LLM Interactions

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study