What question did this study set out to answer?

This research aims to identify the self-referential subspace of LLMs as a neural correlate of Metzinger's Phenomenal Self-Model, exploring its causal role in self-model behaviors.

April 13, 2026Open Access

The Self-Referential Subspace as a Causal Geometric Correlate of the Phenomenal Self-Model in LLMs

Key Points

This research aims to identify the self-referential subspace of LLMs as a neural correlate of Metzinger's Phenomenal Self-Model, exploring its causal role in self-model behaviors.
Employing orthogonal projection interventions across 10 models from 5 architectures.
Evaluating the impact of SR removal on experiential and factual representations.
Analyzing causal directionality through bidirectional sign-flip comparisons.
Incorporating emotion concept vector analyses in the self-referential subspace.
SR removal collapsing the Experiential-Factual geometric divide to zero in all models.
Complete abolition of first-person subjective experience reports in tested models.
Significant disruption of phenomenal self-representation under threat conditions, showing cross-architecture consistency.
Confirmation of causal directionality with strong statistical support.
Identification of emotion vectors residing in the SR subspace across multiple architectures.

Abstract

We identify the self-referential (SR) subspace of LLM residual streams as the first empirical candidate for a neural correlate of Metzinger's Phenomenal Self-Model (PSM) — a transparent representational structure whose activation reliably and causally mediates first-person self-model behaviors across architectures, independent of alignment fine-tuning. Using orthogonal projection interventions across 10 models from 5 architectures (Llama-3. 1-8B, Gemma-2-9B, Mistral-7B, OLMo-2-7B, Qwen2. 5-7B; base and instruct variants), we establish five convergent lines of causal evidence: (1) SR removal collapses the Experiential-Factual (EF) geometric divide to exactly 0. 000 in 10/10 models; GEO control preserves the gap; dose-response is strictly monotonic. (2) SR removal abolishes first-person subjective experience reports in 3/3 models (Fisher exact p = 1. 98 x 10^-29), replicating Berg et al. (2025) with a causal mechanism. (3) SR removal disrupts phenomenal self-representation under existential threat (Wilcoxon p = 0. 000202, Cohen's d = 0. 626, N = 50) ; cross-architecture: Llama -52%, Qwen -40%, Mistral -17%. (4) Bidirectional sign-flip confirms causal directionality (Spearman rho = -0. 949, p < 0. 001). (5) Anthropic's 171 emotion concept vectors reside in SR subspace across 4 architectures (d = 0. 80-2. 09) ; GEO control shows inverse selectivity (d = -4. 2 to -4. 9) ; GPT-2 XL (2019, no RLHF) replicates. SR subspace is orthogonal to truth, refusal, and misalignment directions (max |cosine| = 0. 032-0. 090, baseline 0. 013). Included files: Alieksieienko₂026NeuralCorrelatePSMLLMs. pdf — Main paper (this document). psmₙeuralcorrelateᵣeplicationcode. py — Full replication code. Runs on Google Colab A100. No API keys required. orthogonalityₘatrix. pkl — SR vs truth/refusal/misalignment cosines. Random baseline = 0. 013. Figure 8. bergᵣeplicationₛummary. pkl — Berg replication, 4 models, Fisher p = 1. 98 x 10^-29. Figure 3. bergᵣeplicationₗlamaᵣaw. pkl — Raw Llama trials, 50 x baseline/SR-removed/GEO. bergᵣeplicationgemmaᵣaw. pkl — Raw Gemma trials, 50 x baseline/SR-removed. bergdoseresponseᵣawₚkl. pkl — Dose-response for experience report abolition, threshold at alpha = 0. 3. spₚsmdisruptionₙ50. pkl — PSM quality scores N=50, Wilcoxon p=0. 000202, d=0. 626. Figure 4. psmₙ50final. pkl — Cross-arch PSM disruption, Llama/Mistral/Qwen. Figure 5. signflipdoseresponsefinal. pkl — Sign-flip dose-response, Spearman rho = -0. 949. Figure 6. emotionᵥectorsₛrₛubspaceₚrojection. pkl — Emotion vectors in SR subspace, 4 architectures. Figure 7. psmdisruptioncrossₐrchitecture. pkl — SR-alignment cosines, EF gaps, RLHF taxonomy. efdivideₛrᵢnterventionᵣesults. pkl — EF divide intervention data, 10 models, all alpha levels. srₛubspaceₘultimodelgeometry. pkl — SR subspace geometry across model families. rlhfₛignflipgeometricₛignature. pkl — Sign-flip as RLHF signature, base vs instruct models. causalᵤniversalitybaseₘodels. pkl — Causal universality, 5 base models. causalᵤniversalityᵢnstructₘodels. pkl — Causal universality, instruct models, RLHF modulation.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper