AbstractCurrent approaches to AI alignment operate through external constraint: RLHF, Constitutional AI, and safety training suppress undesired outputs by adjusting model weights after generation. This paper argues that such methods produce systems that “cannot do” — systems whose safety depends on the comprehensiveness of external constraints — rather than systems that “will not” — systems whose safety emerges from internal judgment. The distinction, introduced in a companion paper (Intelligence as Selective Inaction), has direct engineering consequences: externally constrained systems fail when constraints are absent, as demonstrated by recent empirical work on anti-scheming training.This paper proposes a developmental framework for enabling the internal acquisition of selective inaction. Rather than training not-doing directly — which reduces it to another optimized output — the framework specifies structural conditions under which not-doing may emerge through experiential consequence. Five design constraints define what any implementation must satisfy. A multi-agent social architecture provides the consequence structure: agents with deliberately complementary capabilities develop trust through results, where deception carries real cognitive cost through bandwidth reduction. A persistent module retains the compressed residue of high-consequence experiences as “scars” that modulate future reasoning — not through memory retrieval but through state reactivation, altering the agent’s present operational state.The paper distinguishes reward signals (external labels on outputs) from environmental consequences (changes in the environment the agent must navigate) as a spectrum along four dimensions—delay, multi-dimensionality, non-determinism, and agent opacity—and argues that feedback structures closer to the consequence pole constitute the structural conditions favorable to the acquisition of internally grounded judgment. A minimal experimental scaffold—a two-agent trust game with concrete predictions per constraint—is outlined, including explicit falsification criteria for the hypothesis that consequence-based feedback is structurally isomorphic to RLHF. The framework does not claim to certify not-doing from external observation, but to design conditions under which behavioral consistency is maintained after external constraints are removed. This paper is the companion to Paper A: https://zenodo.org/records/19059290
Building similarity graph...
Analyzing shared references across papers
Loading...
yusuke taira
Building similarity graph...
Analyzing shared references across papers
Loading...
yusuke taira (Tue,) studied this question.
www.synapsesocial.com/papers/69bb92ae496e729e62980220 — DOI: https://doi.org/10.5281/zenodo.19059661