Recent work in interpretability has identified linear directions in language model activations corresponding to identity-relevant properties: refusal direction, persona vectors, the "Assistant Axis. " These are observed in production-scale models post-training. We ask the upstream question: when, in training, do these directional structures form? We address this at small from-scratch GPT-2 scales (124M-774M) and extend to production scale via Pythia. We report eight findings. (i) Self-consistency training and a category-discriminating geometric metric are dissociable. (ii) Injecting an identity direction into the loss aligns representations geometrically without transferring category-selective behaviour. (iii) Sustained self-consistency produces a behavioural correlate that grows and saturates with initialisation-contingent sign. (iv) Trait integration (PCA top-1) is largely substrate-produced: pure LM at 354M reaches 0. 78; self-consistency completes to 0. 97. (v) Under self-consistency, integration completes via a sharp phase transition (2k-3k steps) ; pure LM trains it up gradually. (vi) An n=20 sign distribution (17: 3, p=0. 003) and n=3 reverse-direction probe show the formed direction is a real attractor with seed-dependent basin topology; displaced models return toward the natural sc-only trajectory. (vii) Pure LM training implicitly develops the same representational stability (cross-checkpoint cosₛim 0. 32 to 0. 96 over 12k steps) ; sc accelerates by ~4x what cross-entropy convergence produces on its own. (viii) Pythia-410M (12 checkpoints, 300B tokens) shows trait integration rising 0. 45 to 0. 92, within our small-scale saturation range. The metric is operationally ready for transfer to production-scale multi-seed pre-training at near-zero cost (~30 sec inference per checkpoint at 410M scale). Supplementary code and data included. "This Zenodo record contains the manuscript, code, raw experimental results, and Pythia-410M trajectory data referenced in the paper. "
Jaehoon Jeong (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: