What does this research mean for the field?

Identity-relevant directional structures in language models form either through a sharp phase transition under self-consistency training or gradually through pure language modeling, acting as real attractors with seed-dependent basin topology. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to understand when identity-related structures develop during the training of language models.

May 31, 2026Open Access

Supplementary materials for "Persistence Without Identity, and Identity From Persistence: Phase Transitions and Implicit Self-Consistency in Persona Dimension Formation"

Key Points

This research aims to understand when identity-related structures develop during the training of language models.
Examined GPT-2 models ranging from 124M to 774M parameters and extended findings to the Pythia model.
Implemented self-consistency training along with various geometric metrics to analyze identity dimensions.
Conducted various checks on trait integration, stability, and directional consistency over training steps.
Self-consistency training significantly enhances trait integration from 0.78 to 0.97, showcasing a strong positive effect.
Sharp phase transitions occur under self-consistency, with rapid growth and saturation observed at specific training steps.
Pythia-410M demonstrates an increased trait integration metric rising from 0.45 to 0.92, validating findings at larger scales.

Abstract

Recent work in interpretability has identified linear directions in language model activations corresponding to identity-relevant properties: refusal direction, persona vectors, the "Assistant Axis. " These are observed in production-scale models post-training. We ask the upstream question: when, in training, do these directional structures form? We address this at small from-scratch GPT-2 scales (124M-774M) and extend to production scale via Pythia. We report eight findings. (i) Self-consistency training and a category-discriminating geometric metric are dissociable. (ii) Injecting an identity direction into the loss aligns representations geometrically without transferring category-selective behaviour. (iii) Sustained self-consistency produces a behavioural correlate that grows and saturates with initialisation-contingent sign. (iv) Trait integration (PCA top-1) is largely substrate-produced: pure LM at 354M reaches 0. 78; self-consistency completes to 0. 97. (v) Under self-consistency, integration completes via a sharp phase transition (2k-3k steps) ; pure LM trains it up gradually. (vi) An n=20 sign distribution (17: 3, p=0. 003) and n=3 reverse-direction probe show the formed direction is a real attractor with seed-dependent basin topology; displaced models return toward the natural sc-only trajectory. (vii) Pure LM training implicitly develops the same representational stability (cross-checkpoint cosₛim 0. 32 to 0. 96 over 12k steps) ; sc accelerates by ~4x what cross-entropy convergence produces on its own. (viii) Pythia-410M (12 checkpoints, 300B tokens) shows trait integration rising 0. 45 to 0. 92, within our small-scale saturation range. The metric is operationally ready for transfer to production-scale multi-seed pre-training at near-zero cost (~30 sec inference per checkpoint at 410M scale). Supplementary code and data included. "This Zenodo record contains the manuscript, code, raw experimental results, and Pythia-410M trajectory data referenced in the paper. "

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper