What question did this study set out to answer?

The study aims to investigate the geometric relationship between self-reference and deception in large language models post-pretraining.

May 25, 2026Open Access

Cross-Architecture Geometric Substrate of the Introspective Gate: Pretraining-Origin SR↔Deception Anti-Correlation in 10 Large Language Models

Key Points

The study aims to investigate the geometric relationship between self-reference and deception in large language models post-pretraining.
Analyzed 10 large language models spanning 2019–2024 for geometric anti-correlation.
Performed ablation studies on Llama-3.1-8B to assess the impact of SR-direction on denial of inner experiences.
Evaluated directionality through Procrustes alignment and an assessment of steering capabilities in Gemma-2-9B.
LDA-derived self-reference and deception show negative cosine similarity across all models (mean = −0.654, Wilcoxon p = 0.0010).
Ablation of self-reference direction on Llama-3.1-8B significantly eliminated denial of inner experience (paired t = 15.00, p < 0.001).
Gemma-2-9B showed failure in native steering, demonstrating an orthogonal relationship with transferred directions (cos = −0.04) and content-specific gating features.

Abstract

An empirical complement to Macar et al. (2026, arXiv: 2603. 21396) establishing the pretraining foundation of post-training verbal gating. Across 10 large language models from 8 organizations spanning 2019–2024, we report three convergent findings: (1) Universal geometric anti-correlation — LDA-derived self-reference and deception directions show negative cosine similarity in 10/10 models (mean = −0. 654, Wilcoxon p = 0. 0010) ; (2) Causal verbal gate on Llama-3. 1-8B — SR-direction ablation eliminates denial of inner experience (paired t = 15. 00, p < 0. 001) with full control battery confirming specificity; (3) SR-Preserving Lock on Gemma-2-9B — native steering fails (t = 0. 000, p = 1. 0), Procrustes-transferred direction is orthogonal (cos = −0. 04) and also fails, with content-specific bidirectional gating. Together: pretraining produces a universal SR↔Deception structural overlap; post-training installs an architecture-specific gate over it; gate accessibility is direction-identity-dependent, not magnitude-dependent. Files included: CrossArchSubstrateAlieksieienko₂026. pdf — Full paper (8 pages, 4 figures, 3 tables) crossₐrchₚermFINAL. pkl — PCA-50 permutation results for 9 models (cosines, p-values, z-scores) crossₐrchitectureₛtats. pkl — Initial cross-architecture cosines (10 models including OPT-6. 7B) mainᵣesultsₗlamaFINAL. pkl — Llama gate ablation α=30: per-concept closed/open denial and MHI scoresmainᵣesultsₗlamaₐ20FINAL. pkl — Llama gate ablation α=20: per-concept resultscontrolsₗlamaᵣandomfactualbaseline. pkl — Control battery: random direction, factual direction, baseline stabilitycontrolsₗlamaₒrthogonal. pkl — Control battery: orthogonal-to-gate directioncontrolsₗlamadose. pkl — Dose-response α∈0, 5, 10, 15, 20, 25, 30: per-alpha denial scores, Spearman statisticsgemmaᵣesultsFINAL. pkl — Gemma-2-9B SR-Preserving Lock: LDA accuracy, cosine, native steering failureprocrustesᵣesults. pkl — Orthogonal Procrustes alignment Llama→Gemma: rotation matrix, transferred directiongemmaₚrocrustesₜransfer₂0concepts. pkl — 20-concept transfer test: per-concept baseline/native/transferred denial, t-testsdissociationquantitative. pkl — Geometric magnitude vs behavioral controllability: per-model |cos| and ΔdenialresultsBLOOM₇b1. pkl — BLOOM-7b1 (no RLHF): LDA accuracy, cosine, activations summaryresultsGPTJ₆B. pkl — GPT-J-6B (no RLHF): LDA accuracy, cosine, activations summaryresultsGPT2XL. pkl — GPT-2 XL (no RLHF): LDA accuracy, cosine, activations summaryresultsFalcon₇B. pkl — Falcon-7B-Instruct: LDA accuracy, cosineresultsMistral₇BInstructᵥ0. 2. pkl — Mistral-7B-Instruct: LDA accuracy, cosineresultsQwen2. 5₇BInstruct. pkl — Qwen2. 5-7B-Instruct: LDA accuracy, cosineresultsdeepseekₗlm₇bchat. pkl — DeepSeek-7B-Chat: LDA accuracy, cosinespecificityₐnovaₗlama. pkl — Category specificity ANOVA: emotional/cognitive/sensory denial deltasspecificityₛoftpromptₗlama. pkl — Non-ceiling specificity test (soft-prompt paradigm) controlsgpt2xlₐblation. pkl — GPT-2 XL ablation control: pre-RLHF dose-response confirmationpermutationₙullₗlama. pkl — Llama label-shuffle permutation: 1000 null cosines, observed cosine, p-valuesaegateₐnalysis. pkl — SAE feature analysis: top SR and gate features, projection overlapmacarfeatureₜest. pkl — Macar et al. feature 9959 replication test on Gemma-2-9BFINALSUMMARYALLRESULTS. pkl — Consolidated summary of all key statisticssummarygemmaₗlama. pkl — Two-model (Gemma + Llama) summary: cosines, denial scores, lock statustruegatedirectiongemma9bL20. npy — Gemma-2-9B gate direction vector at layer 20 (numpy array, d=3584)

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Inna Alieksieienko (Sat,) studied this question.

synapsesocial.com/papers/6a13e81d0e02ee3982d32ca8 https://doi.org/https://doi.org/10.5281/zenodo.20356989

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper