What question did this study set out to answer?

This research aims to investigate the governability of language models during instruction execution and error detection.

March 14, 2026Open Access

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

Key Points

This research aims to investigate the governability of language models during instruction execution and error detection.
Evaluated six language models across twelve reasoning capability domains.
Conducted a controlled experiment varying architecture and training adaptation.
Observed silent commitment failures in instruction-following models for conflict detection.
Two out of three evaluated instruction-following models exhibited silent commitment failure.
One model demonstrated detectable conflict signals sufficient for intervention 57 tokens prior to commitment.
Governability measured did not correlate with benchmark accuracy, indicating variability among models.

Abstract

As large language models (LLMs) are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime, allowing monitoring systems, guardrails, or human reviewers to intervene before the agent acts. We present empirical evidence from a preliminary cohort of LLMs that this assumption fails for two of three instruction-following models evaluable for conflict detection; broader replication across additional architectures and scales is needed to establish prevalence rates. We introduce the concept of governability — the degree to which a model's errors are detectable before output commitment and correctable once detected — and demonstrate that it is an empirically measurable property that varies dramatically across models and capability domains. In our evaluation of six language models across twelve reasoning capability domains, two out of three instruction-following models evaluable for conflict detection exhibited silent commitment failure: confident, fluent, incorrect output with zero predictive warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment — sufficient for pre-action intervention. We further show that benchmark accuracy does not predict governability, that correction capacity varies independently of detection, and that identical governance scaffolds produce opposite effects across models — improving one, having no effect on another, and degrading a third. We additionally demonstrate through controlled experimentation that the conflict detection signal — what we term the "authority band" — is determined by pretraining architecture and cannot be created or removed through light fine-tuning. In a 2×2 experiment varying architecture (Phi-3 vs Mistral) and training adaptation (baseline vs LoRA), we observe a 52× difference in spike ratio between architectures but only ±0.32× variation from fine-tuning, suggesting governability is a geometric property fixed at pretraining. We propose a governability assessment framework consisting of a Detection the reference implementation of the trajectory-tension detector is available upon request for research validation.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper