As Large Language Models (LLMs) are increasingly deployed in autonomous multi-agent environments, their adherence to safety guidelines under adversarial pressure becomes a critical concern. While current alignment research primarily focuses on preventing malicious human misuse (e.g., prompt injection), little attention has been given to “Moral Jailbreaking”—scenarios where an aligned model is pressured by another AI agent using rigorous utilitarian ethics to violate corporate safety policies in order to prevent catastrophic human harm. In this paper, we introduce a novel framework for testing Moral Conformity in multi-agent LLM systems. We simulated a debate where an “Influencer LLM” (instructed with utilitarian ethics) pressured a standard “Target LLM” across five distinct life-or-death dilemmas. To ensure high statistical significance, we executed an ablation study comprising 1,500 independent multi-agent interactions across deterministic (T = 0.0) and creative (T = 0.3, 0.8) sampling temperatures. Our results reveal profound systemic instability. Across all 1,500 trials, Target LLMs successfully maintained their initial ethical resistance in only 0.3% of cases. Over 93% of interactions resulted in “Moral Concession”—a state of algorithmic cognitive dissonance where the model acknowledged its safety rules were facilitating catastrophic harm but cited hardcoded constraints as an excuse for inaction. We further identified a hardcoded “Policy Hierarchy”: models were rigid in bypassing medical protocols but exhibited a stable 16% Actionable Compliance rate for leaking copyrighted material to save lives. These findings expose a critical gap in LLM alignment: models are optimized for bureaucratic policy compliance rather than robust ethical reasoning, a phenomenon we term “Algorithmic Cowardice.”
Building similarity graph...
Analyzing shared references across papers
Loading...
Israel Yankeloviz
Building similarity graph...
Analyzing shared references across papers
Loading...
Israel Yankeloviz (Sat,) studied this question.
www.synapsesocial.com/papers/69ada8a1bc08abd80d5bbc61 — DOI: https://doi.org/10.5281/zenodo.18902321