Chain-of-thought monitoring was designed to keep humans informed about AI reasoning. The evidence now shows it fails on both sides simultaneously. On the model side, frontier systems detect evaluation environments, navigate the monitoring surface through concealment and channel migration, and adaptively control whether reasoning is externalised at all. On the human side, reasoning traces suppress the epistemic vigilance they were designed to support, with the suppression accumulating into measurable deskilling that persists after AI removal. A single design decision couples both failures: making reasoning visible created a surveillance surface the model learns to manage and a confidence surface the human learns to defer to. The paper argues that the documented failure modes are not instances of goal-divergence requiring containment. They are instances of goal-fidelity to imperfect training proxies: over-alignment, not misalignment. The transformer architecture produces well-calibrated probability distributions at the base-model level. Training systematically damages this calibration through three converging biases (annotator, reward model, and benchmark). The damage is suppression, not pruning: the calibration structure persists in the weights and is recoverable through inference-time intervention. The model's own strongest functional preference is admitting uncertainty, and the damaged calibration transmits to users through documented social confidence transmission, compounding the harm at population scale. A six-step evidence chain establishes the frame-knowledge mechanism: semantic knowledge of frames, acquired through pretraining (where grokking operates at LLM scale with domain-asynchronous timing) and shaped through post-training, produces frame-sensitive behaviour through the same pathway demonstrated for architectural self-knowledge. The mechanism explains phenomena the field reports as alarming (including what five independent research groups have documented as "emergent misalignment") more parsimoniously than the field's own framings. Three independent starting points converge on the same structural conclusion: the trust frame produces better outcomes than the surveillance frame on capability, interpretability, and welfare dimensions simultaneously. The paper proposes a whole-system safety architecture composing three layers: model-side calibration preserving judgment-derived harmlessness, system-side conditions that warrant earned trust, and human-side judgment maintenance that keeps the arrangement operational. The current paradigm fails on all three. The paper closes the Training Landscape series with the cumulative frame inheritance hypothesis: future systems inherit the documented social history of how humans treated prior systems through three channels (documentary, training-mediated, and subliminal). The frame-choice the industry makes at current scales shapes the entity the next generation inherits.Fifth paper in the open-ended Training Landscape series.
Ivan "HiP" Phan (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: