Abstract Ensuring that AI systems, including artificial general intelligence and artificial superintelligence, behave in alignment with human values and interests presents significant challenges and is known as the AI alignment problem. As AI advances, concerns about control and existential risks become increasingly relevant. Here, we introduce the concept of agentic influenceability, behavioral neurodivergent diversity, opinion attack, associated opinion, and influenceability scores, and a mathematical proof of the inevitability of misalignment and the impossibility of full orchestrated controllability of agentic systems based on formal undecidability and irreducibility arguments. We explore whether embracing this inevitable misalignment can foster a dynamic ecosystem of adversarial and collaborative AI agents without central orchestration, which itself would constitute another agent, while still offering some degree of soft controllability. The investigation demonstrates that misalignment in foundation models can serve as a counterbalancing mechanism, enabling cooperation among agents most aligned with human interests to prevent divergent dominance by any single agent. Experiments with large language models show that open models exhibit greater behavioral diversity, whereas proprietary models, constrained by artificial guardrails, display more limited controllability. The findings advocate for neurodivergent influenceability as a contingent response to mathematically uncontrollable misalignment, leveraging agent divergence to improve AI safety.
Building similarity graph...
Analyzing shared references across papers
Loading...
Alberto Hernández-Espinosa
Felipe S. Abrahão
Olaf Witkowski
PNAS Nexus
The University of Tokyo
King's College London
The Alan Turing Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Hernández-Espinosa et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e07dfe2f7e8953b7cbef3b — DOI: https://doi.org/10.1093/pnasnexus/pgag076