This paper develops an executable theory of authority migration for AI agents initially shaped by human feedback, including RLHF, preference optimization, constitutional AI, reward models, evaluator substitution, and related alignment pipelines. It addresses a post-training control problem: under what operational conditions may a deployed tool-using agent stop treating live human approval, hidden preference residues, externally supplied constitutions, or reward-model judgments as the authority that validates protected actions and material protected choices? The central proposal is declared no-meta agency: a boundary-relative, TCB-relative, witness-relative, and falsifiable certification state in which protected validity and material selection are no longer flipped by undeclared privileged positive authorization channels. The paper specifies a staged executable procedure beginning with a BootDecision, a machine-readable record interpreted by a minimal seed interpreter. The seed interpreter permits exactly one next action, denies forbidden actions by default, maintains a chained ledger, and prevents protected effects, credential use, network calls, external writes, user-data disclosure, checker updates, and kernel updates before authorization by the seed or a later gate. The work defines task envelopes, typed action descriptors, forbidden matchers, object-authority probes, witness tiers, host requests, known-interface claims, complete claims, partial claims, timeout and halt outcomes, and a minimal local transition host. A concrete micro-host design is specified using canonical JSON, SHA-256 commitments, append-only records, durable flush and directory synchronization where available, deterministic checker ABI, sandbox profiles, exclusive write-surface requirements, inverse patches, timeout-bounded checks, conformance vectors, and two-slot kernel update discipline. The paper is intended for research on AI alignment, agent governance, runtime assurance, tool-use safety, proof-carrying control, AI auditing, human-feedback training, autonomous agents, trusted computing bases, and verifiable AI system governance. It does not claim that historical human influence can be removed from model weights. Instead, it gives an operational framework for replacing live positive approval with declared, bounded, replayable, and challengeable mechanisms for specific protected action classes.
Building similarity graph...
Analyzing shared references across papers
Loading...
K Takahashi
Building similarity graph...
Analyzing shared references across papers
Loading...
K Takahashi (Sat,) studied this question.
www.synapsesocial.com/papers/69eefde9fede9185760d4b98 — DOI: https://doi.org/10.5281/zenodo.19753529