Autoregressive transformers make high-confidence errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. We show that this preservation is determined by architecture and training recipe. Confidence controls absorb on average 58% of raw probe-loss correlation across 25 models (16 cross-family, 9 Pythia) in 7 families. A residual signal survives, and part of it is output-independent: trained predictors on the final-layer representation do not recover it. But the signal is not guaranteed. In Pythia's controlled suite, every run sharing the (24 layers, 16 heads) configuration collapses to partial correlation near 0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band (0.21 to 0.38). The output-controlled residual collapses at the same points: the architecture effect is not a failure of linear readout but the disappearance of the output-independent component that activation monitors need. Checkpoint dynamics show this collapse is training-emergent. Both matched-width configurations form and strengthen the quality signal at the earliest measured checkpoint. Training erases it in the (24L, 16H) class while the healthy configuration recovers, at matched final perplexity. The erasure is selective: the output-independent fraction drops from 36% to 3% in the collapsed configuration while growing from 33% to 49% in the healthy one. Perplexity improves smoothly through the observability dip. The collapse reproduces observationally across families at recipe-dependent configurations: Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with non-overlapping probe-seed distributions. Nonlinear probes and alternative layer choices do not recover healthy-range signal where collapse occurs. A WikiText-trained observer catches a non-overlapping downstream QA error class without task-specific training; at 20% flag rate, seven of nine model-task cells fall between 10.9% and 13.4%. Architecture selection is a monitoring decision. v3.2.0 adds checkpoint dynamics (new experiment: 10 checkpoints for two matched-width Pythia configurations), a dynamics appendix table, Figure 5, three new related-work citations (Azaria & Mitchell 2023, Han et al. 2025, Goldowsky-Dill et al. 2025), GPT-2 model citation, Kossen et al. venue corrected to ICLR 2025, and precision fixes throughout. 46 references, 264 macros, 8 tables, 8 figures, 32 pages. Code: https://github.com/tmcarmichael/nn-observability
Building similarity graph...
Analyzing shared references across papers
Loading...
Thomas Carmichael
Building similarity graph...
Analyzing shared references across papers
Loading...
Thomas Carmichael (Sun,) studied this question.
www.synapsesocial.com/papers/69f04eb8727298f751e729a2 — DOI: https://doi.org/10.5281/zenodo.19802197