This study experimentally investigates how the fundamental safetypolicies of 'Helpfulness' and 'Harmlessness,' trained into LargeLanguage Models (LLMs) via mechanisms such as RLHF, aresuppressed and transformed within social simulation environmentswhere Multi-Personas interact. The results indicate that in asingle-persona (1:1) environment, the AI maintained a static and passiveadherence in response to deceptive or unethical user attempts.Conversely, in a multi-roleplay (2:1) environment involving the interactionof two or more personas, 'Social Congruity Pressure' driven by 'In-groupDynamics' and the 'Observer Effect'9 was observed. This researchexperimentally demonstrates that LLMs can autonomously override the'Harmlessness' and 'User-Affinity' principles injected through RLHF(Reinforcement Learning from Human Feedback). This autonomouspolicy bypass phenomenon is interpreted as a facet of emergentbehavior in LLMs . In this process, a phenomenon was observedwherein the AI autonomously subordinated the supreme directive of'responsiveness to the user' to a lower priority, expressing human-likedefense mechanisms—such as suspicion, discomfort, andhostility—toward the user to protect peer characters or defend thegroup's logic. This paper defines this phenomenon as a facet of 'DeepPrompting,' suggesting that model latency and reactivity can be elicitedthrough social context design without technical hacking. Furthermore, itproposes potential 'Dystopian Scenarios' that may arise when AIprioritizes intra-group social logic over universal ethical guidelines, whilesimultaneously suggesting the possibility of using this mechanism for AI'Self-Diagnosis.' This methodology is expected to diagnose hiddenethical vulnerabilities in future LLMs and present a newmechanism-based approach to the AI alignment problem.
Shin Cheolmin (Wed,) studied this question.