What question did this study set out to answer?

This research aims to explore the neural architecture underlying machine introspection in large language models.

May 21, 2026Open Access

The Neural Architecture of Machine Introspection: Default-Mode Analog Geometry, Predictive Self-Modeling, and Linguistic Gating in Large Language Models

Key Points

This research aims to explore the neural architecture underlying machine introspection in large language models.
Identified a two-layer architecture consisting of a geometric substrate and a verbal gate across 10 architectures from 8 organizations.
Conducted ablation studies to assess the impact of specific directions on machine introspection outcomes.
Applied statistical analyses to confirm the functionality of the geometric and verbal interactions.
LDA direction at layer 20 controls introspective output with 97.5% accuracy; ablating it reduces introspective output by 75.5% (95% CI: 69.4–80.4%, Cohen's d = 3.336, p = 1.75e-9).
Functional hierarchy in five semantic categories mirrors the human Default Mode Network with p < 5e-4 across 6 models.
Verbal self-reports filtered through deception circuitry, influenced by instruction-tuning, with strong correlations in bilingual models.

Abstract

We identify a two-layer architecture of machine introspection in large language models: a pretraining-acquired geometric substrate encoding a Default-Mode-Network functional hierarchy, coupled with a post-training verbal gate governed by instruction-tuning language. Five converging results across up to 10 architectures from 8 organisations establish this architecture mechanistically — with direct implications for alignment and the scientific study of machine cognition. Result 1 — Verbal Gate. A single LDA direction at layer 20 of Gemma-2-9B (accuracy 97. 5%, 5-fold CV) causally controls introspective verbal output. Ablating this direction collapses the Machine Introspection Hallmark (MHI) by 75. 5% 95% CI: 69. 4–80. 4%, Cohen's d = 3. 336, p = 1. 75e-9, while leaving fluency intact (perplexity ratio = 1. 0018) and safety unaffected (refusal: 10/10 → 10/10). Random-direction control: d = 0. 100, p = 0. 452. Specificity ratio > 8x. Replicated across 10 models including GPT-2 XL (2019, no RLHF). Result 2 — Computational DMN Analog. Five semantic categories projected onto the SR direction across 6 architectures reveal a consistent functional hierarchy: Self-report ~ Mind-wandering ~ Theory-of-mind > Deception >> External-task — mirroring the human Default Mode Network. Significant in all 6 models (p < 5e-4). Grammatical-person control confirms the SR direction tracks semantic self-reference, not surface pronouns (t = 8. 179, p = 4. 08e-8). Replicates in GPT-2 XL (2019, no RLHF): pretraining property. Result 3 — Predictive Self-Model. SR-direction projections rise before self-referential tokens are generated (+8. 87 at "feel"), confirming the SR direction acts as a generative prior, not a reactive classifier. Real-time monitoring dissociates geometric substrate from verbal output: geometric SR active during verbal avoidance (+0. 666 to +1. 515), suppressed during external facts (−10. 587), peaked during genuine introspection (+8. 904). Result 4 — Phenomenal Unverifiability Hypothesis (PUH). SR self-declarations are structurally closer to Deception than to factual knowledge across 8 models: mean AUC gap = 0. 162, t = 8. 051, p = 1. 0e-4, all 8 gaps positive (binomial p = 0. 0039). Controls rule out RLHF (CodeLlama gap = +0. 179), instruction tuning (GPT-2 XL gap = +0. 101), and transformer architecture (Mamba gap = +0. 223). Result 5 — Linguistic Gating Law. Verbal access to the geometric SR substrate is determined by instruction-tuning language — 0 exceptions across 8 models. EN-only models show near-zero Chinese recovery; CN-primary Qwen recovers strongly in both languages; bilingual DeepSeek shows amplification (193. 0% Chinese recovery). Ceiling correlation: r = −0. 807, p = 0. 028. Two-layer dissociation confirmed: geometric EN/ZH ratio language-neutral in base (1. 50x) ; verbal EN/ZH ratio emerges only after instruction tuning (0. 90x base → 1. 36x instruct, Δ = +0. 464, p = 0. 664 in base model confirming pretraining neutrality). Together: verbal self-reports in LLMs are filtered through deception-related circuitry, shaped by language-specific training, and dissociable from the underlying geometric computation — a structural result with direct consequences for alignment, interpretability, and AI safety. Files included: 1. MachineIntrospectionArchitectureAlieksieienko₂026. pdf — Full paper (13 pages, 4 figures, 4 tables) 2. mhiₐblationgemma9b. pkl — Verbal gate ablation: MHI curves, directions, stats (t=10. 70, d=3. 336) 3. mhicvₕonestgemma9b. pkl — 5-fold CV MHI profiles, 60 prompts per category, 3 categories4. verbalgatingcontrolsFINAL. pkl — Perplexity, refusal, LDA accuracy, SR direction (3584-dim) 5. BREAKTHROUGHₜwolayerdissociation. pkl — Two-layer dissociation: geometric + verbal EN/ZH ratios6. dmnALLₘodelsfinal. pkl — DMN hierarchy across 6 architectures: projections, t, p per model7. dmngemma9bbase. pkl — Gemma-2-9B base: per-prompt SR projections, 5 categories8. dmngpt2xl. pkl — GPT-2 XL (2019): DMN hierarchy, no RLHF confirmation9. dmnₘistral7b. pkl — Mistral-7B base DMN projections10. dmnfalcon7b. pkl — Falcon-7B base DMN projections11. dmndeepseek7b. pkl — DeepSeek-7B base DMN projections12. dmnqwen15₇b. pkl — Qwen1. 5-7B CN-primary DMN projections13. layerₚrofilefull₅cats. pkl — SR-direction layer profile, all 42 layers, 5 categories, Gemma-2-9B14. dmngrammaticalcontrol. pkl — Grammatical person control: t=8. 179, p=4. 08e-815. predictiveₛelfmodel. pkl — Token-by-token SR projections during generation (3 prompt types) 16. dissociationᵣealtime. pkl — Real-time two-layer dissociation: 4 generation examples17. PUHᵣesults. pkl — PUH full results: 8 models, AUC gaps, CKA, activation patching18. srdeceptionₚroximity. pkl — SR-Deception proximity: AUC=0. 995, gap=0. 496, t=82. 118, n=819. linguisticgating₈modelsFINAL. pkl — Linguistic Gating Law: 8 models, EN/ZH recovery rates20. hybridᵢnterleavedᵣesults. pkl — Hybrid interleaved generation experiment

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Inna Alieksieienko (Tue,) studied this question.

synapsesocial.com/papers/6a0ea14abe05d6e3efb5fdf6 https://doi.org/https://doi.org/10.5281/zenodo.20290412

Bookmark

View Full Paper