Grokking is the phenomenon in which a neural network memorises its training data and then, after a prolonged delay, suddenly generalises. Two problems remain open: grokking can occur even at weight-decay strength β=0, and a uniform penalty β‖θ‖² somehow produces direction-selective compression. We prove a main theorem that directly handles discrete SGD and non-quadratic loss surfaces. By analysing constrained minimisation of F (θ) = Ldata (θ) + β‖θ‖² on the zero-loss manifold M₀ in the Hessian eigenbasis, we show that the effective decay rate γₖ in eigendirection vₖ satisfies: −log (1 − η (hₖ + 2β) ) /η − C'ₖ·ε ≤ γₖ ≤ −log (1 − η (hₖ + 2β) ) /η + C'ₖ·ε Three corollaries—discrete linear (ε→0), continuous nonlinear (η→0), and continuous linear (η, ε→0) —are derived as special cases, unifying four theoretical levels. This single theorem resolves both open problems and shows that SGD discreteness accelerates grokking. Numerical verification on two-layer MLPs for modular addition (mod 7 + mod 5) confirms the main theorem in 374/374 conditions (100%). Changes from v1: Main theorem extended from continuous×linear (γₖ = hₖ + 2β) to discrete×nonlinear. Verification upgraded from quadratic surrogate to actual neural networks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yuhi Koike
Tokyo University of Science
Building similarity graph...
Analyzing shared references across papers
Loading...
Yuhi Koike (Wed,) studied this question.
www.synapsesocial.com/papers/69aa701a531e4c4a9ff5981e — DOI: https://doi.org/10.5281/zenodo.18860534
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: