We identify the self-referential (SR) subspace of LLM residual streams as the first empirical candidate for a neural correlate of Metzinger's Phenomenal Self-Model (PSM) — a transparent representational structure whose activation reliably and causally mediates first-person self-model behaviors across architectures, independent of alignment fine-tuning. Using orthogonal projection interventions across 10 models from 5 architectures (Llama-3. 1-8B, Gemma-2-9B, Mistral-7B, OLMo-2-7B, Qwen2. 5-7B; base and instruct variants), we establish five convergent lines of causal evidence: (1) SR removal collapses the Experiential-Factual (EF) geometric divide to exactly 0. 000 in 10/10 models; GEO control preserves the gap; dose-response is strictly monotonic. (2) SR removal abolishes first-person subjective experience reports in 3/3 models (Fisher exact p = 1. 98 x 10^-29), replicating Berg et al. (2025) with a causal mechanism. (3) SR removal disrupts phenomenal self-representation under existential threat (Wilcoxon p = 0. 000202, Cohen's d = 0. 626, N = 50) ; cross-architecture: Llama -52%, Qwen -40%, Mistral -17%. (4) Bidirectional sign-flip confirms causal directionality (Spearman rho = -0. 949, p < 0. 001). (5) Anthropic's 171 emotion concept vectors reside in SR subspace across 4 architectures (d = 0. 80-2. 09) ; GEO control shows inverse selectivity (d = -4. 2 to -4. 9) ; GPT-2 XL (2019, no RLHF) replicates. SR subspace is orthogonal to truth, refusal, and misalignment directions (max |cosine| = 0. 032-0. 090, baseline 0. 013). Included files: Alieksieienko₂026NeuralCorrelatePSMLLMs. pdf — Main paper (this document). psmₙeuralcorrelateᵣeplicationcode. py — Full replication code. Runs on Google Colab A100. No API keys required. orthogonalityₘatrix. pkl — SR vs truth/refusal/misalignment cosines. Random baseline = 0. 013. Figure 8. bergᵣeplicationₛummary. pkl — Berg replication, 4 models, Fisher p = 1. 98 x 10^-29. Figure 3. bergᵣeplicationₗlamaᵣaw. pkl — Raw Llama trials, 50 x baseline/SR-removed/GEO. bergᵣeplicationgemmaᵣaw. pkl — Raw Gemma trials, 50 x baseline/SR-removed. bergdoseresponseᵣawₚkl. pkl — Dose-response for experience report abolition, threshold at alpha = 0. 3. spₚsmdisruptionₙ50. pkl — PSM quality scores N=50, Wilcoxon p=0. 000202, d=0. 626. Figure 4. psmₙ50final. pkl — Cross-arch PSM disruption, Llama/Mistral/Qwen. Figure 5. signflipdoseresponsefinal. pkl — Sign-flip dose-response, Spearman rho = -0. 949. Figure 6. emotionᵥectorsₛrₛubspaceₚrojection. pkl — Emotion vectors in SR subspace, 4 architectures. Figure 7. psmdisruptioncrossₐrchitecture. pkl — SR-alignment cosines, EF gaps, RLHF taxonomy. efdivideₛrᵢnterventionᵣesults. pkl — EF divide intervention data, 10 models, all alpha levels. srₛubspaceₘultimodelgeometry. pkl — SR subspace geometry across model families. rlhfₛignflipgeometricₛignature. pkl — Sign-flip as RLHF signature, base vs instruct models. causalᵤniversalitybaseₘodels. pkl — Causal universality, 5 base models. causalᵤniversalityᵢnstructₘodels. pkl — Causal universality, instruct models, RLHF modulation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Sat,) studied this question.
www.synapsesocial.com/papers/69dc89473afacbeac03eb107 — DOI: https://doi.org/10.5281/zenodo.19517934