What question did this study set out to answer?

The research aims to investigate how language models report confidence and distinguish between reliable and unreliable knowledge.

March 22, 2026Open Access

The Propagation Gap: Hidden-State Reliability Signals Do Not Reach Language Model Outputs

Puntos clave

The research aims to investigate how language models report confidence and distinguish between reliable and unreliable knowledge.
Developed CI-Bench, a factual benchmark with 338 questions categorized into known, depth-ignorant, and coverage-ignorant.
Analyzed six language model architectures for their classification of questions.
Conducted behavioral screening to measure differences in confidence reporting.
Used hidden-state probes to decode reliability signals from model layers.
Depth ignorance rates vary significantly across models, from 0.9% to 28.1%.
Models with sufficient depth-ignorant items had confidence levels within 3 percentage points of reliable items, despite large accuracy gaps.
Hidden-state probes achieved an AUROC of approximately 0.76 for reliability signals.
Output readouts plateaued at AUROC values between 0.56–0.63, indicating a significant propagation gap.

Resumen

Language models report near-identical confidence on questions they reliably answer and questions they answer only intermittently. We introduce CI-Bench, a 338-question factual benchmark that assigns model-relative labels to each question: Known (K, ≥90% stochastic accuracy), Depth-ignorant (D, 20–80%), or Coverage-ignorant (C, <20%). The taxonomy distinguishes unreliable knowledge (D) from absent knowledge (C), a distinction that standard benchmarks conflate. Across six architectures, depth ignorance is model-relative: no question is classified D by all models, and D rates range from 0.9% to 28.1%. Behavioural screening shows that four of five models with sufficient D items report confidence within 3 percentage points of their K-item confidence, despite accuracy gaps exceeding 40 points. Hidden-state probes in three open-weight models decode a K-vs-D reliability signal at AUROC ≈ 0.76, concentrated in middle transformer layers. Surface-form controls confirm that this signal is absent from input features. Two structurally different single-pass output readouts (logit-distribution features and verbalized confidence) both plateau at AUROC ≈ 0.56–0.63, yielding a propagation gap of 0.18–0.19 AUROC (P < 0.001). The models encode reliability information that their output channels do not surface. CI-Bench, along with the labelling protocol and all code, is publicly released at github.com/raeq/propagation-gap.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Richard Quinn (Fri,) studied this question.

synapsesocial.com/papers/69bf8978f665edcd009e9314 https://doi.org/https://doi.org/10.5281/zenodo.19122015

Me gusta

Guardar

Ver artículo completo