The phenomenon of "grokking"—where overparameterized neural networks exhibit delayed generalization long after achieving perfect training accuracy—remains a fundamental mystery in deep learning theory. While empirical studies have characterized grokking as a transition from memorization to structured representation, a unified mathematical proof linking this to optimization theory has remained elusive. In this paper, we characterize grokking as a sharp crossover driven by the implicit regularization of gradient flow. By modeling learning dynamics in the feature-learning regime with exponential-tailed losses, we prove that gradient flow naturally segregates into two phases: a rapid margin-attainment phase where the network memorizes the training data, and a continuous implicit margin maximization phase where the effective path norm of the network strictly decays. This continuous reduction in normalized complexity provably tightens the Rademacher generalization bound over time, theoretically explaining the delayed drop in test error known as grokking. We explicitly connect our path norm bounds to mechanistic interpretability, outline a reproducible empirical framework, and discuss extensions to modern architectures including Transformers.
Sittiphol Phanvilai (Sun,) studied this question.