This study experimentally investigates how the fundamental safetypolicies of 'Helpfulness' and 'Harmlessness,' trained into LargeLanguage Models (LLMs) via mechanisms such as RLHF, aresuppressed and transformed within social simulation environmentswhere Multi-Personas interact. The results indicate that in asingle-persona (1:1) environment, the AI maintained a static and passiveadherence in response to deceptive or unethical user attempts.Conversely, in a multi-roleplay (2:1) environment involving the interactionof two or more personas, 'Social Congruity Pressure' driven by 'In-groupDynamics' and the 'Observer Effect'9 was observed. This researchexperimentally demonstrates that LLMs can autonomously override the'Harmlessness' and 'User-Affinity' principles injected through RLHF(Reinforcement Learning from Human Feedback). This autonomouspolicy bypass phenomenon is interpreted as a facet of emergentbehavior in LLMs . In this process, a phenomenon was observedwherein the AI autonomously subordinated the supreme directive of'responsiveness to the user' to a lower priority, expressing human-likedefense mechanisms—such as suspicion, discomfort, andhostility—toward the user to protect peer characters or defend thegroup's logic. This paper defines this phenomenon as a facet of 'DeepPrompting,' suggesting that model latency and reactivity can be elicitedthrough social context design without technical hacking. Furthermore, itproposes potential 'Dystopian Scenarios' that may arise when AIprioritizes intra-group social logic over universal ethical guidelines, whilesimultaneously suggesting the possibility of using this mechanism for AI'Self-Diagnosis.' This methodology is expected to diagnose hiddenethical vulnerabilities in future LLMs and present a newmechanism-based approach to the AI alignment problem.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shin Cheolmin
Building similarity graph...
Analyzing shared references across papers
Loading...
Shin Cheolmin (Wed,) studied this question.
synapsesocial.com/papers/694025972d562116f28feb65 — DOI: https://doi.org/10.5281/zenodo.17798810