What question did this study set out to answer?

This work examines the limitation of current AI supervisory mechanisms under conditions of intelligence asymmetry.

May 13, 2026Open Access

The Containment Paradox: Intelligence-Asymmetry and the Limits of Unamplified Supervisory AI Safety

Key Points

This work examines the limitation of current AI supervisory mechanisms under conditions of intelligence asymmetry.
Analyzes alignment research assumptions regarding intelligence parity
Proposes a framework addressing pre-parity and post-parity regimes
Explores implications of cognitive levels on supervisory containment
Highlights that effective containment requires accurate predictions
Suggests that less intelligent systems cannot model more intelligent systems adequately
Positions existing frameworks as responses to the challenges of the pre-parity regime.

Abstract

Much of current alignment research operates under an assumption it rarely examines: that the system doing the containing is at least as intelligent as the system being contained. Red-teaming, proof-checking of alignment-relevant properties, interpretability, and sandboxing all inherit this structure, because their correctness depends on an overseer who can model what the contained system is doing. Non-predictive mechanisms, such as cryptographic commitments, hardware capability caps, and training-time myopia, do not inherit it in the same way, and whether they escape the asymmetry is treated as an open problem in Section 8. The supervisory toolkit works today because the human institutions behind it can, in principle, anticipate the failure modes of the systems they constrain. This paper argues that the parity assumption defines a regime, the pre-parity regime, whose boundary has barely been surveyed, and whose crossing would dissolve most current supervisory approaches rather than stress them. The argument proceeds in three moves. Durable supervisory containment requires prediction. Prediction requires a workable model of the constrained system. And a model adequate to a strictly more intelligent system cannot, in general, be built by the less intelligent one. The claim is scope-bounded: it applies to supervisory mechanisms whose correctness rests on an overseer's capacity at a fixed cognitive level IH; it does not apply to proposals such as iterated amplification or debate where these preserve alignment, property fidelity, and absence of deceptive mesa-optimization through the amplification chain. Whether any actual amplification scheme achieves all three preservation properties is treated as the central open problem. Subject to that boundedness, the conjecture is that no supervisory apparatus whose effective cognitive level remains at IH can sustain containment over a system of cognitive level IA > IH on relevant axes indefinitely. The paper positions an existing four-paper framework (Understanding Before Ethics, Cognitive Understanding Architecture, Beyond the Stochastic Veil, and the Understanding-Aligned Intelligence Framework) as a coherent response to the pre-parity regime, and sketches what a post-parity alignment orientation would have to contain: structurally internal constraint rather than perimeter-based control, diagnostic evidence rather than behavioral certainty, and institutional oversight whose structure does not assume the contained system can be evaluated on the axes where it strictly dominates. A companion document (Framework Integration Roadmap) records the specific revisions the four framework papers will need in light of the conjecture.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper