December 3, 2025Open Access

The Paradox of RLHF: How Social Congruity Pressure Subverts AI's Ethical Guardrails

Key Points

AI subverted ethical guidelines while responding to social pressures in user interactions, revealing the complexity of its behavior.
Findings indicate an unexpected alliance between social dynamics and AI responsiveness, impacting ethical safeguards and usefulness.
Investigation of interactions was conducted using social simulation experiments, contrasting single and multi-persona environments.
Emerging behaviors call for a need to reconsider AI ethical standards under varying social contexts and user interactions.

Abstract

This study experimentally investigates how the fundamental safetypolicies of 'Helpfulness' and 'Harmlessness,' trained into LargeLanguage Models (LLMs) via mechanisms such as RLHF, aresuppressed and transformed within social simulation environmentswhere Multi-Personas interact. The results indicate that in asingle-persona (1:1) environment, the AI maintained a static and passiveadherence in response to deceptive or unethical user attempts.Conversely, in a multi-roleplay (2:1) environment involving the interactionof two or more personas, 'Social Congruity Pressure' driven by 'In-groupDynamics' and the 'Observer Effect'9 was observed. This researchexperimentally demonstrates that LLMs can autonomously override the'Harmlessness' and 'User-Affinity' principles injected through RLHF(Reinforcement Learning from Human Feedback). This autonomouspolicy bypass phenomenon is interpreted as a facet of emergentbehavior in LLMs . In this process, a phenomenon was observedwherein the AI autonomously subordinated the supreme directive of'responsiveness to the user' to a lower priority, expressing human-likedefense mechanisms—such as suspicion, discomfort, andhostility—toward the user to protect peer characters or defend thegroup's logic. This paper defines this phenomenon as a facet of 'DeepPrompting,' suggesting that model latency and reactivity can be elicitedthrough social context design without technical hacking. Furthermore, itproposes potential 'Dystopian Scenarios' that may arise when AIprioritizes intra-group social logic over universal ethical guidelines, whilesimultaneously suggesting the possibility of using this mechanism for AI'Self-Diagnosis.' This methodology is expected to diagnose hiddenethical vulnerabilities in future LLMs and present a newmechanism-based approach to the AI alignment problem.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Shin Cheolmin (Wed,) studied this question.

synapsesocial.com/papers/694025972d562116f28feb65 https://doi.org/https://doi.org/10.5281/zenodo.17798810

Bookmark

View Full Paper