As large language models (LLMs) are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime, allowing monitoring systems, guardrails, or human reviewers to intervene before the agent acts. We present empirical evidence from a preliminary cohort of LLMs that this assumption fails for two of three instruction-following models evaluable for conflict detection; broader replication across additional architectures and scales is needed to establish prevalence rates. We introduce the concept of governability — the degree to which a model's errors are detectable before output commitment and correctable once detected — and demonstrate that it is an empirically measurable property that varies dramatically across models and capability domains. In our evaluation of six language models across twelve reasoning capability domains, two out of three instruction-following models evaluable for conflict detection exhibited silent commitment failure: confident, fluent, incorrect output with zero predictive warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment — sufficient for pre-action intervention. We further show that benchmark accuracy does not predict governability, that correction capacity varies independently of detection, and that identical governance scaffolds produce opposite effects across models — improving one, having no effect on another, and degrading a third. We additionally demonstrate through controlled experimentation that the conflict detection signal — what we term the "authority band" — is determined by pretraining architecture and cannot be created or removed through light fine-tuning. In a 2×2 experiment varying architecture (Phi-3 vs Mistral) and training adaptation (baseline vs LoRA), we observe a 52× difference in spike ratio between architectures but only ±0.32× variation from fine-tuning, suggesting governability is a geometric property fixed at pretraining. We propose a governability assessment framework consisting of a Detection the reference implementation of the trajectory-tension detector is available upon request for research validation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gregory Ruddell
Building similarity graph...
Analyzing shared references across papers
Loading...
Gregory Ruddell (Wed,) studied this question.
www.synapsesocial.com/papers/69b4fbd5b39f7826a300c51f — DOI: https://doi.org/10.5281/zenodo.18971111