What question did this study set out to answer?

The aim is to differentiate between AI systems that cannot act safely due to external constraints and those that develop internal judgment for safety.

March 19, 2026Open Access

From "Cannot" to "Will Not": Developmental Conditions for Selective Inaction in AI Systems

Key Points

The aim is to differentiate between AI systems that cannot act safely due to external constraints and those that develop internal judgment for safety.
Introduces a developmental framework for selective inaction in AI systems.
Defines five design constraints necessary for implementation.
Utilizes a multi-agent social architecture to develop trust and modulate reasoning.
Ak agents with complementary capabilities enhance trust through results.
Behavioral consistency can be maintained without external constraints.
Establishes a trust game as an experimental scaffold to test the framework.

Abstract

AbstractCurrent approaches to AI alignment operate through external constraint: RLHF, Constitutional AI, and safety training suppress undesired outputs by adjusting model weights after generation. This paper argues that such methods produce systems that “cannot do” — systems whose safety depends on the comprehensiveness of external constraints — rather than systems that “will not” — systems whose safety emerges from internal judgment. The distinction, introduced in a companion paper (Intelligence as Selective Inaction), has direct engineering consequences: externally constrained systems fail when constraints are absent, as demonstrated by recent empirical work on anti-scheming training.This paper proposes a developmental framework for enabling the internal acquisition of selective inaction. Rather than training not-doing directly — which reduces it to another optimized output — the framework specifies structural conditions under which not-doing may emerge through experiential consequence. Five design constraints define what any implementation must satisfy. A multi-agent social architecture provides the consequence structure: agents with deliberately complementary capabilities develop trust through results, where deception carries real cognitive cost through bandwidth reduction. A persistent module retains the compressed residue of high-consequence experiences as “scars” that modulate future reasoning — not through memory retrieval but through state reactivation, altering the agent’s present operational state.The paper distinguishes reward signals (external labels on outputs) from environmental consequences (changes in the environment the agent must navigate) as a spectrum along four dimensions—delay, multi-dimensionality, non-determinism, and agent opacity—and argues that feedback structures closer to the consequence pole constitute the structural conditions favorable to the acquisition of internally grounded judgment. A minimal experimental scaffold—a two-agent trust game with concrete predictions per constraint—is outlined, including explicit falsification criteria for the hypothesis that consequence-based feedback is structurally isomorphic to RLHF. The framework does not claim to certify not-doing from external observation, but to design conditions under which behavioral consistency is maintained after external constraints are removed. This paper is the companion to Paper A: https://zenodo.org/records/19059290

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper