What question did this study set out to answer?

This research aims to investigate how different prompting styles affect semantic feature activation within deep learning models.

February 5, 2026Open Access

Sparse Feature Analysis of Deep Layer Expansion: A Mechanistic Interpretation via SAE

Key Points

This research aims to investigate how different prompting styles affect semantic feature activation within deep learning models.
Applied Sparse Autoencoder to analyze activations from Deep Layer Expansion.
Compared novice versus expert prompts using feature activation metrics.
Conducted UMAP visualization to assess clustering of prompt conditions.
Novice prompts activate 17% more features compared to expert prompts.
369 features are activated exclusively by novice prompts, compared to 208 for expert prompts.
Six distinct clusters are identified in activation space, showing clear semantic differences between prompt styles.

Abstract

Zhao (2026) demonstrated that expert-level prompts induce "Deep Layer Expansion"—a 60-100% increase in Effective Intrinsic Dimension (EID) at deep layers. However, EID is a global metric that does not reveal which semantic features are activated. In this paper, we apply Sparse Autoencoder (SAE) analysis to decompose the activation differences between prompt styles. Using Goodfire's Llama-3.3-70B SAE (Layer 50, 65,536 features), we find that: (1) "Explain to a novice" activates 17% more features than "explain to an expert" (132.4 vs 113.1 on average); (2) 369 features are exclusively activated by novice prompts vs 208 for expert prompts; (3) 10 features show perfect separation between Novice vs Expert conditions; (4) Through AutoInterp analysis (6 conditions × 50 topics = 300 samples), we discover these features exhibit semantic subdivision—encoding distinct dimensions such as "expert identity," "serious attitude," "depth requirement," and "technical analysis"; (5) UMAP visualization confirms that 6 prompt conditions form distinct clusters in both raw activation space and SAE feature space, with SAE acting as a semantic denoiser that merges noise-only conditions (standard/padding/spaces) while preserving semantic distinctions (novice/expert/guru). These findings suggest prompt effects are compositional, with different elements triggering different feature subsets.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper