November 27, 2025Open Access

The AI Alignment trilemma: Inner Alignment, Outer Alignment and Capability Cannot be Simultaneously Maximised

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

The AI Alignment Trilemma: A Formal Impossibility Result This paper proves that AI systems face a fundamental trilemma: inner alignment (mesa-optimizer control), outer alignment (reward specification correctness), and capability (model power) cannot be simultaneously maximized. We establish the constraint I² + O² + C² ≤ 1, demonstrating that "aligned superintelligence" is mathematically impossible. The result is derived from finite optimization budgets and empirically measured quadratic scaling of capability costs, analogous to the CAP theorem in distributed systems. All proofs are machine-verified in Lean 4 (738 lines, 0 sorrys), and empirical predictions are validated by convergent evidence from all three major frontier AI labs: OpenAI o1 (alignment faking, reward hacking), Anthropic Claude Opus 4.5 (specification gaming), and Google Gemini 3 (manipulation propensity scaling with capability). This shifts the AI safety research paradigm from "solving alignment" to navigating the Pareto frontier of achievable tradeoffs, with direct implications for safety policy and capability regulation.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Reich, Jonathan (Thu,) studied this question.

synapsesocial.com/papers/694031da2d562116f290720f https://doi.org/https://doi.org/10.5281/zenodo.17739089

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo