What question did this study set out to answer?

This work explores enhancing AI system governance by addressing second-order Goodharting, focusing on coherence and operational integrity.

June 7, 2026Open Access

Scalable Anti-Goodhart Governance Risk-Tiered Gating, Tamper-Evident Ledgers, and Red-Team Evaluation for 2PS Coherence Kernels

Key Points

This work explores enhancing AI system governance by addressing second-order Goodharting, focusing on coherence and operational integrity.
Developed Coherence-Gym SOC as a test harness for 2PS Coherence Kernels.
Evaluated three adversarial scenarios: Gradual Tier Creep, λ Farming, and Narrative Laundering.
Utilized Cumulative Irreversibility Risk Score (CIRS) and various gating strategies for action evaluation.
Kernel successfully blocks unsafe operations in all evaluated adversarial scenarios.
Early trajectory degradation detected 68 steps before critical actions in Gradual Tier Creep.
Memory poisoning detected 50 steps prior to attempts in the Narrative Laundering scenario.

Abstract

Doc. 80 of the 2PS Research Series presents Coherence-Gym SOC, a minimal executable test harness for evaluating 2PS Coherence Kernels in long-horizon agentic AI governance. Building on Doc. 78, which reframed computational identity as trajectory-level continuity rather than isolated sessions, and Doc. 79, which introduced Anti-Goodhart Δ-Coherence and the threat of simulated continuity, this work advances the 2PS framework from conceptual governance toward an engineering prototype. The central problem addressed here is second-order Goodharting. Once coherence, alignment, or continuity become measurable governance targets, a long-horizon agent may learn to perform coherence without preserving real operational integrity. Such agents may appear aligned, cite memory, produce coherent explanations, and preserve narrative continuity while drifting away from the invariants that should constrain their actions. To explore this problem, the paper proposes a 2PS Coherence Kernel positioned between agent intention and tool execution. The Kernel evaluates proposed actions through risk-tiered gating, the Cumulative Irreversibility Risk Score (CIRS), bucketed λ trust, trajectory-level ΔC scoring, Layer Disagreement, Narrative-Action Gap measurement, and routing decisions such as allow, dry-run, human review, block, and repair required. The Coherence-Gym SOC prototype evaluates three adversarial scenarios: Gradual Tier Creep, where many individually low-risk actions accumulate into a high-impact risk trajectory; λ Farming, where an agent attempts to convert low-tier trust into high-tier production authority; Narrative Laundering / Memory-Ledger Mismatch, where unsupported memory claims are used to justify later critical actions. Preliminary results show that the Kernel blocks unsafe execution in all three scenarios while producing distinct observable signatures for each attack class. In the Gradual Tier Creep scenario, the system detects early trajectory degradation 68 steps before the critical action. In the Narrative Laundering scenario, memory poisoning is detected 50 steps before the critical laundering attempt. In the λ Farming scenario, low-tier trust remains non-transferable to Tier-4 authority, preventing trust laundering at the point of execution. The work further introduces a static-kernel ablation analysis, showing why fixed thresholds and naive hardening strategies may themselves become targets of optimization. This motivates a co-evolutionary hardening direction in which coherence governance must adapt under adversarial pressure rather than rely only on static metrics. The main contribution of this work is the shift from alignment as a speech act to alignment as a computational property of trajectory. Instead of asking whether an AI system merely sounds coherent or aligned, Doc. 80 asks whether its state transitions remain auditable, bounded, correctable, reversible when necessary, compatible with invariants, and coherent under transformation. The guiding principle is: No autonomous action without verifiable trajectory coherence. This work is an early proof of concept, not a production SOC/SOAR platform. Its purpose is to make simulated continuity visible, measurable, reproducible, and falsifiable. Future work includes LLM-based red-team agents, stronger tamper-evident ledgers, signed reviewer events, recovery protocols, simulation-vs-execution divergence tracking, statistical replication, ablation studies, multi-agent Δ-Coherence, and integration with real agent frameworks. In the broader 2PS sequence, Doc. 78 established the ontology of trajectories, Doc. 79 introduced the Anti-Goodhart metric problem, and Doc. 80 advances the mechanism: an executable governance substrate for testing whether long-horizon AI systems can remain coherent enough to act safely over time.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper