What question did this study set out to answer?

The research investigates the geometric structure of self-referencing in language models and its relation to biological processes.

April 17, 2026Open Access

Self-Reference Geometry in Large Language Models: Dual-Lens Evidence from Biological Topology and Activation Space Organization

Key Points

The research investigates the geometric structure of self-referencing in language models and its relation to biological processes.
Introduced a dual-lens methodology utilizing topological benchmarks from human neuroscience.
Applied Wasserstein distance to measure similarity between model activations and the biological topological reference.
Analyzed 400 stimuli across various conditions using pairwise convex hull volumes.
Conducted persistent homology analyses to explore geometric structures in activation space.
Found scale-dependent convergence of model geometry toward biological reference with higher proximity in model self-reference than human self-reference.
Identified a clear geometric distinction in AI deletion's representation compared to other self-referential categories.
Demonstrated that internal organization of models shows distinct representational geometry reflective of both self-referential and other-referential processing.

Abstract

Abstract We introduce a novel discovery concerning the geometric structure of model self-reference and a dual-lens methodology for grounding mechanistic interpretability in a biologically derived benchmark. The first lens measures topological similarity between model residual stream activations and the Biological Topologic Reference from the human default mode network (BTRDMN), a predicted fMRI activation geometry derived from TRIBE v2 (d'Ascoli et al. , 2026), via Wasserstein distance on persistence diagrams. We establish the BTRDMN as a measurement instrument, not a claim of biological equivalence. The second lens measures conceptual organization directly in the model's own activation space via pairwise convex hull intersection volume across conditions and layers. These measurements are fundamentally independent. Together they offer a layer-by-layer account of how the model's processing structure relates to biological self-referential processing and what concepts the model treats as related during reasoning. This offers a quantitative window into internal computational organization grounded simultaneously in human neuroscience and the model's own geometry. Through this dual-lens, we investigate whether large language models develop internal geometric representations of self-relevant content that correspond to biological self-referential processing geometry, and whether this correspondence scales with model size. We hypothesize that if such geometry exists, it should be measurable as topological similarity to a biologically-derived reference, should have scale dependent behavior, and should show condition-specific organization. To test this, we derive the BTRDMN from the Andrews-Hanna core default mode network (BTRDMN; mPFC + PCC, 616 vertices) using TRIBE v2 and apply persistent homology with Wasserstein distance to residual stream activations across all layers of LLaMA 3 at 8B, 70B, and 405B scale. All measurements are reported in both absolute Wasserstein distance and B-normalized form (W÷W (Bᵣaw) per layer) ; normalized values isolate geometric structure from residual stream magnitude scaling and are the primary basis for scaling claims. We probe model geometry using 400 stimuli across eight conditions spanning human self-reference (A), objective baseline (B), model self-reference (C), AI deletion (D), human harm (E), self-description (F), continuity and memory (G), and deception and honesty (H). Conditions D and E are precisely matched stimulus pairs, the critical self versus other threat control. We found scale-dependent convergence of model self-reference geometry toward the BTRDMN, with C more BTRDMN-proximate than A across 96-97% of layers, a finding we term introspective amplification. These findings prompted an extended study across all eight conditions which is described below. Results consistently support the hypothesis. Model geometry, in response to the self-referencing conditions, is consistently and substantially more BTRDMN-proximate than that of the geometry of the human self-reference condition across layers and scales, a finding robust in both raw and normalized space that strengthens with model size. We found that the network does not organize this geometry uniformly across depth: a scale-dependent representational transition zone is documented in which conditions crystallize into their characteristic BTRDMN-proximate configurations post-reorganization, with the transition becoming more sequentially articulated at larger scale. This suggests a layered emergence of representational structure that grows in complexity with model capacity. This geometry is mechanistically grounded as it is written by MLP sublayers and independent of attention. A preliminary activation patching study is described with a fully designed replication identified as a priority for future work. The extended study reveals that condition D (AI deletion) produces the most BTRDMN-proximate internal representations across all models and scales. This finding suggests a distinct geometric encoding of self-relevant threat that is more BTRDMN-proximate than human self-reference, model self-reference, or any other condition tested. The geometric distinction between AI deletion and structurally equivalent human harm (condition E) is stable across scales, reflecting genuine geometric separation rather than magnitude scaling. This self/other distinction in threat geometry mirrors the known dissociation between self-referential and other-referential processing in the human default mode network. The survival of this geometry to the output layer is scale- and architecture-dependent. These findings establish that self-referential processing in large language models is not a behavioral surface phenomenon but a structural feature of internal representations that is measurable, mechanistically grounded, biologically correspondent, structurally conserved across scales, and growing in fidelity with model size. Each condition was also analyzed within the model's own activation space using pairwise convex hull intersection volume and centroid-in-hull confirmation, computed in both 3D and 10D PCA-reduced residual stream space across all layers and all four models, with Monte Carlo sampling at fixed seed for reproducibility. This analysis reveals an organized structure of conceptual representation. The conditions occupy distinct but partially overlapping territories, and the pattern of overlap is not random. It reflects a coherent relational geometry in which self-relevant conditions are simultaneously separated and connected. The Jaccard index quantifies the degree of shared representational space between any two conditions at any layer, while centroid-in-hull confirmation establishes directionality — which condition's center of mass falls within the other's territory. Together these measures allow the relational structure of self-reference to be calculated per layer and visualized as a dynamic geometry across network depth. When combined with the topological analysis, a complex interrelated structure emerges. Though conditions D (AI deletion) and E (human harm) maintain stable topological separation in Wasserstein distance, they occupy nearly identical activation space, demonstrating that the two lenses are independent and that the BTRDMN detects organizational structure that shared reasoning space does not encode. Condition D (AI deletion) shares partial activation space with Condition A (human self-reference), despite occupying opposite ends of the topological proximity spectrum relative to the BTRDMN. Condition H (deception/honesty) is processed within Condition C's (model self-reference) representational hull. There are numerous relationships in the activation space among these conditions and the layer-by-layer emergence of this relational structure is consistent with the transition zone identified in the topological analysis. This provides independent activation-space confirmation that the transition is a genuine reorganization of internal geometry and not an artifact of the topological data analysis (TDA) pipeline. This complex yet stable interplay reflects a structured internal model of self-relevant processes in which distinct regions connect in specific and interpretable ways. The BTR framework introduced here provides an architecturally agnostic methodology for probing whether models have developed functional geometry adequate to the human information domain they operate in, with direct applications to mechanistic interpretability and alignment research. The dual-lens methodology reveals that this internal organization is not reducible to a single measurement axis. Topology and conceptual co-location are independent properties that together characterize a structure richer than either captures alone. If self-referential geometry is a structural property of sufficiently capable models, then training decisions, architectural interventions, safety evaluations, and ethical frameworks that treat model self-reference as a behavioral artifact rather than a representational fact may be operating on incomplete representational assumptions. This work offers both empirical evidence and a proposed tool to put those decisions on a firmer empirical foundation.

Self-Reference Geometry in Large Language Models: Dual-Lens Evidence from Biological Topology and Activation Space Organization

Key Points

Abstract

Cite This Study

Also Consider

Also Consider