Language models report near-identical confidence on questions they reliably answer and questions they answer only intermittently. We introduce CI-Bench, a 338-question factual benchmark that assigns model-relative labels to each question: Known (K, ≥90% stochastic accuracy), Depth-ignorant (D, 20–80%), or Coverage-ignorant (C, <20%). The taxonomy distinguishes unreliable knowledge (D) from absent knowledge (C), a distinction that standard benchmarks conflate. Across six architectures, depth ignorance is model-relative: no question is classified D by all models, and D rates range from 0.9% to 28.1%. Behavioural screening shows that four of five models with sufficient D items report confidence within 3 percentage points of their K-item confidence, despite accuracy gaps exceeding 40 points. Hidden-state probes in three open-weight models decode a K-vs-D reliability signal at AUROC ≈ 0.76, concentrated in middle transformer layers. Surface-form controls confirm that this signal is absent from input features. Two structurally different single-pass output readouts (logit-distribution features and verbalized confidence) both plateau at AUROC ≈ 0.56–0.63, yielding a propagation gap of 0.18–0.19 AUROC (P < 0.001). The models encode reliability information that their output channels do not surface. CI-Bench, along with the labelling protocol and all code, is publicly released at github.com/raeq/propagation-gap.
Richard Quinn (Fri,) studied this question.