October 16, 2025Open Access

On the Computational, Informational, and Physical Foundations for AI Safety

Key Points

AI safety approaches face inherent barriers due to limitations in computational complexity and information theory.
Even simplified self-verification methods are shown to be computationally intractable, specifically NP-complete.
Specifications of ambiguous concepts, like harm, are necessarily incomplete, revealing fundamental issues.
A new framework is proposed for reasoning about physically-enforced safety bounds that bypass software dependencies.

Abstract

Current approaches to AI safety predominantly focus on specifying correct behavior through software, data, and rules. This work argues that this approach faces theoretically fundamental, and not merely practical, limitations. I present a multi-layered analysis of this paradigm, demonstrating its inherent barriers from the perspectives of computational complexity, information theory, and physical engineering. In ongoing work, I prove that even simplified forms of semantic self-verification are computationally intractable (NP-complete). I use information theory to show that any specification of an external, ambiguous concept like "harm" is necessarily incomplete. To address these limits, I develop a framework for reasoning about verifiable, physically-enforced safety bounds that are independent of software state.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper