What question did this study set out to answer?

To explore the geometric nature of grokking and its implications for understanding generalization in neural networks.

February 2, 2026Open Access

Grokking as Manifold Discovery: A Geometric Reinterpretation of Delayed Generalization

Key Points

To explore the geometric nature of grokking and its implications for understanding generalization in neural networks.
Review existing theories on grokking and their limitations.
Propose the Manifold Discovery Hypothesis as a new framework.
Conduct experiments on modular addition and multiplication to test the hypothesis.
Analyze representation dimensionality using PCA and examine topological summaries.
Significant drops in effective dimensionality observed in both tasks (from 78 to 8, and 89 to 11).
Topological summaries showed order-of-magnitude changes, indicating shifts in representation.
Cluster structures emerged in visualizations of the learned representations.
In modular multiplication, the model identified quotient group structures with 99.4% purity.

Abstract

Grokking—the phenomenon where neural networks suddenly generalize after prolonged overfitting—has accumulated multiple theoretical explanations since its discovery in 2022: Goldilocks Zone, Softmax Collapse, Lazy-Rich transition, etc. This paper reviews these theories and identifies their common blind spot: most focus on external measurements, lacking direct characterization of representation space geometry. Among them, the Goldilocks Zone theory touches on the "physical laws" of high-dimensional space and carries substantial theoretical value. We propose a unified framework—the Manifold Discovery Hypothesis: memorization is a high-dimensional jagged curve passing through all training points, generalization is discovering the low-dimensional manifold on which data is distributed, and Grokking is the transition from the former to the latter (possibly accompanied by critical state oscillations). We provide evidence supporting this hypothesis on two experimental groups: modular addition and modular multiplication: we observed significant drops in effective dimensionality of representations (78→8 / 89→11 under PCA 95% threshold), order-of-magnitude changes in topological summaries, and emergence of cluster structures in dimensionality-reduced visualizations. Notably, the modular multiplication experiment discovered that the model learned quotient group structure (k mod 12) cosets, purity 99.4%), which prompted us to revise the hypothesis into a two-stage model: local manifold discovery → global gluing. In one sentence: high-dimensional curve → low-dimensional surface.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Jin et al. (Thu,) studied this question.

synapsesocial.com/papers/6980fbbec1c9540dea80d811 https://doi.org/https://doi.org/10.5281/zenodo.18416965

KI fragen

Bookmark

View Full Paper