What question did this study set out to answer?

To explore the behavior and detection of model errors in instruction-tuned language models.

March 22, 2026Open Access

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

Key Points

To explore the behavior and detection of model errors in instruction-tuned language models.
Evaluated six language models across twelve reasoning capability domains.
Measured silent commitment failures and detection rates of model errors.
Assessed governability using a Detection & Correction Matrix.
Two out of three instruction-following models exhibited silent commitment failure without predictive warnings.
One model provided detectable conflict signals sufficient for intervention.
Detection rates varied significantly with model configurations and pretraining architecture.

Abstract

Rev 2 (March 2026) — Updated Section 7.7 with training regime/geometry explanation, revised Section 8.2 model selection guidance, updated Figure 1 and Figure 4 captions, corrected conclusion. As large language models (LLMs) are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime, allowing monitoring systems, guardrails, or human reviewers to intervene before the agent acts. We present empirical evidence from a preliminary cohort of LLMs that this assumption fails for two of three instruction-following models evaluable for conflict detection; broader replication across additional architectures and scales is needed to establish prevalence rates. We introduce the concept of governability — the degree to which a model's errors are detectable before output commitment and correctable once detected — and demonstrate that it is an empirically measurable property that varies dramatically across models and capability domains. In our evaluation of six language models across twelve reasoning capability domains, two out of three instruction-following models evaluable for conflict detection exhibited silent commitment failure: confident, fluent, incorrect output with zero predictive warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding — sufficient for pre-action intervention. Under temperature sampling, the warning margin when detected remains consistent (~58 tokens), but the detection rate drops from 100% to 34%, indicating governability is a model-plus-inference-configuration property. We further show that benchmark accuracy does not predict governability, that correction capacity varies independently of detection, and that identical governance scaffolds produce opposite effects across models — improving one, having no effect on another, and degrading a third. We additionally demonstrate through controlled experimentation that the conflict detection signal — what we term the "authority band" — is determined by pretraining architecture and cannot be created or removed through light fine-tuning. In a 2×2 experiment varying architecture (Phi-3 vs Mistral) and training adaptation (baseline vs LoRA), we observe a 52× difference in spike ratio between architectures but only ±0.32× variation from fine-tuning, suggesting governability is a geometric property fixed at pretraining. We propose a governability assessment framework consisting of a Detection the reference implementation of the trajectory-tension detector is available upon request for research validation.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper