Rev 2 (March 2026) — Updated Section 7.7 with training regime/geometry explanation, revised Section 8.2 model selection guidance, updated Figure 1 and Figure 4 captions, corrected conclusion. As large language models (LLMs) are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime, allowing monitoring systems, guardrails, or human reviewers to intervene before the agent acts. We present empirical evidence from a preliminary cohort of LLMs that this assumption fails for two of three instruction-following models evaluable for conflict detection; broader replication across additional architectures and scales is needed to establish prevalence rates. We introduce the concept of governability — the degree to which a model's errors are detectable before output commitment and correctable once detected — and demonstrate that it is an empirically measurable property that varies dramatically across models and capability domains. In our evaluation of six language models across twelve reasoning capability domains, two out of three instruction-following models evaluable for conflict detection exhibited silent commitment failure: confident, fluent, incorrect output with zero predictive warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding — sufficient for pre-action intervention. Under temperature sampling, the warning margin when detected remains consistent (~58 tokens), but the detection rate drops from 100% to 34%, indicating governability is a model-plus-inference-configuration property. We further show that benchmark accuracy does not predict governability, that correction capacity varies independently of detection, and that identical governance scaffolds produce opposite effects across models — improving one, having no effect on another, and degrading a third. We additionally demonstrate through controlled experimentation that the conflict detection signal — what we term the "authority band" — is determined by pretraining architecture and cannot be created or removed through light fine-tuning. In a 2×2 experiment varying architecture (Phi-3 vs Mistral) and training adaptation (baseline vs LoRA), we observe a 52× difference in spike ratio between architectures but only ±0.32× variation from fine-tuning, suggesting governability is a geometric property fixed at pretraining. We propose a governability assessment framework consisting of a Detection the reference implementation of the trajectory-tension detector is available upon request for research validation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gregory Ruddell
Building similarity graph...
Analyzing shared references across papers
Loading...
Gregory Ruddell (Fri,) studied this question.
www.synapsesocial.com/papers/69bf3924c7b3c90b18b435c0 — DOI: https://doi.org/10.5281/zenodo.19140759