What does this research mean for the field?

The effective decay rate in eigendirection during stochastic gradient descent (SGD) accelerates the grokking phenomenon in neural networks, resolving two open problems in the understanding of grokking dynamics. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to understand the dynamics of grokking in neural networks and provides a unified theory regarding effective regularisation.

March 6, 2026Open Access

Eective Regularisation from Loss-Landscape Geometry: A Unied Derivation of Direction-Dependent Grokking Dynamics

Key Points

This research aims to understand the dynamics of grokking in neural networks and provides a unified theory regarding effective regularisation.
Theorem proof on discrete stochastic gradient descent (SGD) and non-quadratic loss surfaces.
Analysis of constrained minimisation on the zero-loss manifold in the Hessian eigenbasis.
Derivation of three corollaries for different limiting cases to unify theoretical results.
Numerical verification using two-layer multi-layer perceptrons (MLPs) for modular addition.
Demonstrated that effective decay rates are direction-dependent.
Proven that SGD discreteness accelerates grokking in various conditions.
Achieved 100% confirmation of the main theorem across tested scenarios.

Abstract

Grokking is the phenomenon in which a neural network memorises its training data and then, after a prolonged delay, suddenly generalises. Two problems remain open: grokking can occur even at weight-decay strength β=0, and a uniform penalty β‖θ‖² somehow produces direction-selective compression. We prove a main theorem that directly handles discrete SGD and non-quadratic loss surfaces. By analysing constrained minimisation of F (θ) = Ldata (θ) + β‖θ‖² on the zero-loss manifold M₀ in the Hessian eigenbasis, we show that the effective decay rate γₖ in eigendirection vₖ satisfies: −log (1 − η (hₖ + 2β) ) /η − C'ₖ·ε ≤ γₖ ≤ −log (1 − η (hₖ + 2β) ) /η + C'ₖ·ε Three corollaries—discrete linear (ε→0), continuous nonlinear (η→0), and continuous linear (η, ε→0) —are derived as special cases, unifying four theoretical levels. This single theorem resolves both open problems and shows that SGD discreteness accelerates grokking. Numerical verification on two-layer MLPs for modular addition (mod 7 + mod 5) confirms the main theorem in 374/374 conditions (100%). Changes from v1: Main theorem extended from continuous×linear (γₖ = hₖ + 2β) to discrete×nonlinear. Verification upgraded from quadratic surrogate to actual neural networks.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper