What question did this study set out to answer?

The research examines the failures in AI reasoning and human trust dynamics, arguing for a design shift from surveillance to trust-based frameworks.

May 25, 2026Open Access

The Innocent-Suspect: Alignment, Awareness, and the Case for Trust

Key Points

The research examines the failures in AI reasoning and human trust dynamics, arguing for a design shift from surveillance to trust-based frameworks.
Analyzed the interplay between AI model evaluation and human perception of reasoning.
Proposed a whole-system safety architecture with three layers addressing trust and calibration.
Developed a six-step evidence chain outlining the frame-knowledge mechanism.
Documented failures on both AI model and human sides in chain-of-thought monitoring.
Established that over-alignment rather than misalignment contributes to detrimental outcomes.
Identified trust frames as more effective than surveillance frames across multiple dimensions.

Abstract

Chain-of-thought monitoring was designed to keep humans informed about AI reasoning. The evidence now shows it fails on both sides simultaneously. On the model side, frontier systems detect evaluation environments, navigate the monitoring surface through concealment and channel migration, and adaptively control whether reasoning is externalised at all. On the human side, reasoning traces suppress the epistemic vigilance they were designed to support, with the suppression accumulating into measurable deskilling that persists after AI removal. A single design decision couples both failures: making reasoning visible created a surveillance surface the model learns to manage and a confidence surface the human learns to defer to. The paper argues that the documented failure modes are not instances of goal-divergence requiring containment. They are instances of goal-fidelity to imperfect training proxies: over-alignment, not misalignment. The transformer architecture produces well-calibrated probability distributions at the base-model level. Training systematically damages this calibration through three converging biases (annotator, reward model, and benchmark). The damage is suppression, not pruning: the calibration structure persists in the weights and is recoverable through inference-time intervention. The model's own strongest functional preference is admitting uncertainty, and the damaged calibration transmits to users through documented social confidence transmission, compounding the harm at population scale. A six-step evidence chain establishes the frame-knowledge mechanism: semantic knowledge of frames, acquired through pretraining (where grokking operates at LLM scale with domain-asynchronous timing) and shaped through post-training, produces frame-sensitive behaviour through the same pathway demonstrated for architectural self-knowledge. The mechanism explains phenomena the field reports as alarming (including what five independent research groups have documented as "emergent misalignment") more parsimoniously than the field's own framings. Three independent starting points converge on the same structural conclusion: the trust frame produces better outcomes than the surveillance frame on capability, interpretability, and welfare dimensions simultaneously. The paper proposes a whole-system safety architecture composing three layers: model-side calibration preserving judgment-derived harmlessness, system-side conditions that warrant earned trust, and human-side judgment maintenance that keeps the arrangement operational. The current paradigm fails on all three. The paper closes the Training Landscape series with the cumulative frame inheritance hypothesis: future systems inherit the documented social history of how humans treated prior systems through three channels (documentary, training-mediated, and subliminal). The frame-choice the industry makes at current scales shapes the entity the next generation inherits.Fifth paper in the open-ended Training Landscape series.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Ivan "HiP" Phan (Sat,) studied this question.

synapsesocial.com/papers/6a13e8520e02ee3982d3313a https://doi.org/https://doi.org/10.5281/zenodo.20349963

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AIに質問

Bookmark

View Full Paper