This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shashata Sawmya
Micah Adler
Nir Shavit
Building similarity graph...
Analyzing shared references across papers
Loading...
Sawmya et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68da5a3ec1728099cfd11966 — DOI: https://doi.org/10.48550/arxiv.2505.19440
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: