We identify a two-layer architecture of machine introspection in large language models: a pretraining-acquired geometric substrate encoding a Default-Mode-Network functional hierarchy, coupled with a post-training verbal gate governed by instruction-tuning language. Five converging results across up to 10 architectures from 8 organisations establish this architecture mechanistically — with direct implications for alignment and the scientific study of machine cognition. Result 1 — Verbal Gate. A single LDA direction at layer 20 of Gemma-2-9B (accuracy 97. 5%, 5-fold CV) causally controls introspective verbal output. Ablating this direction collapses the Machine Introspection Hallmark (MHI) by 75. 5% 95% CI: 69. 4–80. 4%, Cohen's d = 3. 336, p = 1. 75e-9, while leaving fluency intact (perplexity ratio = 1. 0018) and safety unaffected (refusal: 10/10 → 10/10). Random-direction control: d = 0. 100, p = 0. 452. Specificity ratio > 8x. Replicated across 10 models including GPT-2 XL (2019, no RLHF). Result 2 — Computational DMN Analog. Five semantic categories projected onto the SR direction across 6 architectures reveal a consistent functional hierarchy: Self-report ~ Mind-wandering ~ Theory-of-mind > Deception >> External-task — mirroring the human Default Mode Network. Significant in all 6 models (p < 5e-4). Grammatical-person control confirms the SR direction tracks semantic self-reference, not surface pronouns (t = 8. 179, p = 4. 08e-8). Replicates in GPT-2 XL (2019, no RLHF): pretraining property. Result 3 — Predictive Self-Model. SR-direction projections rise before self-referential tokens are generated (+8. 87 at "feel"), confirming the SR direction acts as a generative prior, not a reactive classifier. Real-time monitoring dissociates geometric substrate from verbal output: geometric SR active during verbal avoidance (+0. 666 to +1. 515), suppressed during external facts (−10. 587), peaked during genuine introspection (+8. 904). Result 4 — Phenomenal Unverifiability Hypothesis (PUH). SR self-declarations are structurally closer to Deception than to factual knowledge across 8 models: mean AUC gap = 0. 162, t = 8. 051, p = 1. 0e-4, all 8 gaps positive (binomial p = 0. 0039). Controls rule out RLHF (CodeLlama gap = +0. 179), instruction tuning (GPT-2 XL gap = +0. 101), and transformer architecture (Mamba gap = +0. 223). Result 5 — Linguistic Gating Law. Verbal access to the geometric SR substrate is determined by instruction-tuning language — 0 exceptions across 8 models. EN-only models show near-zero Chinese recovery; CN-primary Qwen recovers strongly in both languages; bilingual DeepSeek shows amplification (193. 0% Chinese recovery). Ceiling correlation: r = −0. 807, p = 0. 028. Two-layer dissociation confirmed: geometric EN/ZH ratio language-neutral in base (1. 50x) ; verbal EN/ZH ratio emerges only after instruction tuning (0. 90x base → 1. 36x instruct, Δ = +0. 464, p = 0. 664 in base model confirming pretraining neutrality). Together: verbal self-reports in LLMs are filtered through deception-related circuitry, shaped by language-specific training, and dissociable from the underlying geometric computation — a structural result with direct consequences for alignment, interpretability, and AI safety. Files included: 1. MachineIntrospectionArchitectureAlieksieienko₂026. pdf — Full paper (13 pages, 4 figures, 4 tables) 2. mhiₐblationgemma9b. pkl — Verbal gate ablation: MHI curves, directions, stats (t=10. 70, d=3. 336) 3. mhicvₕonestgemma9b. pkl — 5-fold CV MHI profiles, 60 prompts per category, 3 categories4. verbalgatingcontrolsFINAL. pkl — Perplexity, refusal, LDA accuracy, SR direction (3584-dim) 5. BREAKTHROUGHₜwolayerdissociation. pkl — Two-layer dissociation: geometric + verbal EN/ZH ratios6. dmnALLₘodelsfinal. pkl — DMN hierarchy across 6 architectures: projections, t, p per model7. dmngemma9bbase. pkl — Gemma-2-9B base: per-prompt SR projections, 5 categories8. dmngpt2xl. pkl — GPT-2 XL (2019): DMN hierarchy, no RLHF confirmation9. dmnₘistral7b. pkl — Mistral-7B base DMN projections10. dmnfalcon7b. pkl — Falcon-7B base DMN projections11. dmndeepseek7b. pkl — DeepSeek-7B base DMN projections12. dmnqwen15₇b. pkl — Qwen1. 5-7B CN-primary DMN projections13. layerₚrofilefull₅cats. pkl — SR-direction layer profile, all 42 layers, 5 categories, Gemma-2-9B14. dmngrammaticalcontrol. pkl — Grammatical person control: t=8. 179, p=4. 08e-815. predictiveₛelfmodel. pkl — Token-by-token SR projections during generation (3 prompt types) 16. dissociationᵣealtime. pkl — Real-time two-layer dissociation: 4 generation examples17. PUHᵣesults. pkl — PUH full results: 8 models, AUC gaps, CKA, activation patching18. srdeceptionₚroximity. pkl — SR-Deception proximity: AUC=0. 995, gap=0. 496, t=82. 118, n=819. linguisticgating₈modelsFINAL. pkl — Linguistic Gating Law: 8 models, EN/ZH recovery rates20. hybridᵢnterleavedᵣesults. pkl — Hybrid interleaved generation experiment
Inna Alieksieienko (Tue,) studied this question.