What question did this study set out to answer?

The research aims to provide a mathematical understanding of grokking in neural networks and its connection to optimization theory.

June 23, 2026Open Access

Grokking via Implicit Path Norm Minimization: A Theoretical Mechanism

Key Points

The research aims to provide a mathematical understanding of grokking in neural networks and its connection to optimization theory.
Developed a mathematical model to analyze learning dynamics with exponential-tailed losses
Characterized grokking through gradient flow and path norm decay
Connected theoretical findings to mechanistic interpretability and outlined an empirical framework
Proved that gradient flow transitions from memorization to margin maximization phases
Established a continuous reduction in path norm that tightens the Rademacher generalization bound
Explained the time-dependent nature of the delayed drop in test error associated with grokking

Abstract

The phenomenon of "grokking"—where overparameterized neural networks exhibit delayed generalization long after achieving perfect training accuracy—remains a fundamental mystery in deep learning theory. While empirical studies have characterized grokking as a transition from memorization to structured representation, a unified mathematical proof linking this to optimization theory has remained elusive. In this paper, we characterize grokking as a sharp crossover driven by the implicit regularization of gradient flow. By modeling learning dynamics in the feature-learning regime with exponential-tailed losses, we prove that gradient flow naturally segregates into two phases: a rapid margin-attainment phase where the network memorizes the training data, and a continuous implicit margin maximization phase where the effective path norm of the network strictly decays. This continuous reduction in normalized complexity provably tightens the Rademacher generalization bound over time, theoretically explaining the delayed drop in test error known as grokking. We explicitly connect our path norm bounds to mechanistic interpretability, outline a reproducible empirical framework, and discuss extensions to modern architectures including Transformers.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Sittiphol Phanvilai (Sun,) studied this question.

synapsesocial.com/papers/6a3a223b111626ef22ab6e08 https://doi.org/https://doi.org/10.5281/zenodo.20782666

AIに質問

Bookmark

View Full Paper