Grokking---delayed generalization long after memorization---has been observed almost exclusively in classical networks trained with weight decay. We show that 9-layer parameterized quantum circuits (PQCs, 463 parameters) grok modular addition (a+b 23) without weight decay, achieving a 25\% grok rate across 20 seeds. A classical transformer (14, 304 parameters) requires weight decay (=1. 0) for 100\% grokking and fails without it. Weight decay has opposite effects on the two architectures: removing it improves quantum test accuracy (68. 7\% vs. \ 42. 3\%; Mann--Whitney p=0. 002) while eliminating classical grokking entirely. Tracking the bipartite entanglement entropy (EE) of the circuit state, we find that successful grokking occurs in a sweet spot (EE 3. 1, 4. 1), while EE overshoot above 4. 2 triggers a ``grok-then-ungrok'' phenomenon---test accuracy rises to 93. 7\% then collapses to 11. 3\% as entanglement saturates. Ablations confirm that entanglement is necessary (product-state circuits reach 0. 2\%) and that PQCs are more parameter-efficient than size-matched classical models. These results connect quantum unitarity to implicit regularization and entanglement dynamics to generalization transitions.
liang wang (Sun,) studied this question.