What question did this study set out to answer?

This work examines the limitations of AI alignment when actions cannot be undone and values are closely matched.

June 18, 2026Open Access

Alignment as Consensus under Irreversibility: a decision-theoretic case for dialogue over verification

Key Points

This work examines the limitations of AI alignment when actions cannot be undone and values are closely matched.
Analyzed the operational picture of AI alignment focusing on specification-and-verification.
Introduced concept of floors that limit learning from actions under noise and irreversibility.
Developed a theoretical framework that explores the implications of dialogue over verification.
Identified two critical limits: identifiability and irreversibility floors affecting decision-making in AI.
Proven that an external communication channel can alleviate these limits, promoting better alignment strategies.
Established that deference and selective transparency are key components for legitimate consensus in decision-making.

Abstract

The dominant operational picture of AI alignment is specification-and-verification: fix a target, then test for conformance. We argue this picture is sound only where misalignment is coarse and actions are recoverable — and that the hardest problems lie where values are near-tied and noisily observed and actions cannot be undone. There, two floors limit what an agent can learn from its own actions: an identifiability floor, (σ²/Δ²)·log T, for resolving a decision-relevant gap Δ under noise σ, and an irreversibility floor of Ω(T) — an impossibility, not a rate — when the only informative action is itself the unrecoverable commitment. Both are lifted by one device: an external A-channel (asking, eliciting values, trialing before committing) that substitutes for the ability to undo. The resulting 2×2 is proven, in its irreversible half, in a companion matching-market paper backed by a reproducible artifact. Read into alignment, the spine forces a relational picture: deference becomes the uniquely learnable strategy, not an imposed constraint; transparency should be selective, because surveillance destroys the channel it monitors; legitimate persuasion is separated from manipulation by reflective endorsement, a criterion estimable but not certifiable; and collective alignment becomes the selection of a legitimate consensus under a small hard floor of irreversibility prohibitions, within which minority protection and ruin-avoidance are one principle. We are explicit about what is proven, what is argued, and what is left open: the relational picture is what a hard theorem leaves standing once the world is permitted to be irreversible and human values to be near-tied.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper