What question did this study set out to answer?

The aim is to empirically demonstrate the Execution Gap in LLM agent systems and propose methods to close it.

May 2, 2026Open Access

Closing the Execution Gap in LLM Agent Systems Empirical Evidence for Compliant Drift, Partial Observability, and Integrated Runtime

Key Points

The aim is to empirically demonstrate the Execution Gap in LLM agent systems and propose methods to close it.
Implemented the complete Agent Governance Stack as a Python library within LangGraph StateGraph.
Conducted four experiments isolating different dimensions of the Execution Gap.
Analyzed compliant drift, partial observability, and multi-agent coordination characteristics.
Compliant drift was identified with a detection threshold crossed at T* ∈ [259, 403] steps in MockLLM.
The RAM gate achieved zero error rate across all state-coverage levels in partial observability experiments.
Full stack integration demonstrated bounded deviation over 2,000 steps with a 49.5% resolution rate for HALT events.

Abstract

A correctly governed agent system can still fail. An agent may select only actions that each individually satisfy every applicable rule, while its behavioral trajectory drifts silently toward high-risk territory. We call the structural interval in which these failures occur the Execution Gap: the space between what governance validates at decision boundaries and what agents actually do in execution. Existing approaches — prompt guards, OPA/XACML policy engines, Constitutional AI, and audit layers — are structurally incapable of closing this gap. They evaluate actions locally and statelessly; the Execution Gap is a trajectory-level, stateful phenomenon. This paper provides the first empirical demonstration that the Execution Gap is real, measurable, and closeable. We implement the complete Agent Governance Stack (Papers P0–P6: atomic decision boundaries, stateful admission control via ACP, invariant measurement via IML, governance structure, and reconstructive authority via RAM) as a Python library instrumented into a LangGraph StateGraph, and run four experiments that each isolate one dimension of the gap. Key results: Compliant drift (Exp. 1 + 1b): The enforcement signal g (τ) remains identically zero across all 2, 700 drift steps (6 seeds × 450 steps) with the MockLLM, while the IML composite D̂ grows monotonically and crosses the detection threshold θ = 0. 20 in T* ∈ 259, 403 steps — direct experimental proof that compliant drift is real. Replicated with two real LLMs (mistral-small3. 1, T* = 64; deepseek-r1: 8b, T* = 65; g (τ) = 0 throughout for both), confirming the finding is architectural, not model-specific. Partial observability (Exp. 2): The RAM gate achieves IER = 0. 000 at every state-coverage level (0. 10–1. 00), versus baseline IER ∈ 0. 032, 0. 185 for attestation and always-execute strategies (10, 000 Monte Carlo samples per level). Multi-agent coordination (Exp. 3): ACP replicates the formal bound CWₐppr = 2N with zero deviation for N ∈ 2, 4, 8, 16 agents, confirming the result is framework-independent. Full stack integration (Exp. 4): The integrated ACP + IML + RAM + RecoveryLoop stack converges with D̂ bounded in 0. 27, 0. 34 over 2, 000 steps; liveness holds (49. 5% of HALT events resolved by Recovery Loop) ; no deadlock. Beyond confirmation, the implementation surfaces three refinements to the formal theory: the ACP baseline-RS assumption, liveness-rate classification for the conditional liveness theorem, and EMA convergence parametrization. The open-source implementation provides a deployable blueprint for practitioners integrating runtime governance into LangGraph-based agent systems. Code and data: https: //github. com/chelof100/agent-governance-applied This is Paper 7 of the Agent Governance Series (P0–P7; Paper 8 on scale and heterogeneity is in preparation). Related papers: P0 (arXiv: 2604. 17511), P1/ACP (arXiv: 2603. 18829), P2/IML (arXiv: 2604. 17517), P5/RAM (arXiv: 2604. 22898).

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper