What question did this study set out to answer?

The study aims to explore how the architecture and training methods affect the observability of internal signals in autoregressive transformers.

April 28, 2026Open Access

View Full Paper

Architecture Determines Observability in Transformers

Key Points

The study aims to explore how the architecture and training methods affect the observability of internal signals in autoregressive transformers.
Analyzed 25 transformer models across multiple families to assess activation monitoring effectiveness.
Examined checkpoint dynamics and signal preservation through controlled experiments on configurations.
Utilized nonlinear probes and different layer choices to evaluate signal recovery capabilities.
Confidence controls captured 58% of raw probe-loss correlation, with part of the signal output-independent.
Output-independent signal fraction dropped significantly from 36% to 3% in collapsed configurations.
Observations showed consistency across architectures, revealing recipe-dependent signal collapses.

Abstract

Autoregressive transformers make high-confidence errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. We show that this preservation is determined by architecture and training recipe. Confidence controls absorb on average 58% of raw probe-loss correlation across 25 models (16 cross-family, 9 Pythia) in 7 families. A residual signal survives, and part of it is output-independent: trained predictors on the final-layer representation do not recover it. But the signal is not guaranteed. In Pythia's controlled suite, every run sharing the (24 layers, 16 heads) configuration collapses to partial correlation near 0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band (0.21 to 0.38). The output-controlled residual collapses at the same points: the architecture effect is not a failure of linear readout but the disappearance of the output-independent component that activation monitors need. Checkpoint dynamics show this collapse is training-emergent. Both matched-width configurations form and strengthen the quality signal at the earliest measured checkpoint. Training erases it in the (24L, 16H) class while the healthy configuration recovers, at matched final perplexity. The erasure is selective: the output-independent fraction drops from 36% to 3% in the collapsed configuration while growing from 33% to 49% in the healthy one. Perplexity improves smoothly through the observability dip. The collapse reproduces observationally across families at recipe-dependent configurations: Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with non-overlapping probe-seed distributions. Nonlinear probes and alternative layer choices do not recover healthy-range signal where collapse occurs. A WikiText-trained observer catches a non-overlapping downstream QA error class without task-specific training; at 20% flag rate, seven of nine model-task cells fall between 10.9% and 13.4%. Architecture selection is a monitoring decision. v3.2.0 adds checkpoint dynamics (new experiment: 10 checkpoints for two matched-width Pythia configurations), a dynamics appendix table, Figure 5, three new related-work citations (Azaria & Mitchell 2023, Han et al. 2025, Goldowsky-Dill et al. 2025), GPT-2 model citation, Kossen et al. venue corrected to ICLR 2025, and precision fixes throughout. 46 references, 264 macros, 8 tables, 8 figures, 32 pages. Code: https://github.com/tmcarmichael/nn-observability

KI fragen

Bookmark

View Full Paper

Cite This Study

Thomas Carmichael (Sun,) studied this question.

synapsesocial.com/papers/69f04eb8727298f751e729a2 https://doi.org/https://doi.org/10.5281/zenodo.19802197

KI fragen

Bookmark

View Full Paper