What question did this study set out to answer?

This work aims to explore grokking in transformers applied to diverse algebraic structures beyond addition.

March 29, 2026Open Access

Grokking Beyond Addition: Circuit-Level Analysis of Algebraic Learning in Transformers

Key Points

This work aims to explore grokking in transformers applied to diverse algebraic structures beyond addition.
Analyzed eight algebraic operations including abelian and non-abelian groups using 1-layer transformers.
Investigated behavioral differences in accuracy and representation across operations.
Utilized discrete-log re-indexing to enhance Fourier concentration for modular multiplication.
Abelian operations achieved 100% test accuracy while non-abelian groups failed to generalize despite high training accuracy.
Discrete-log re-indexing led to a 2.14× improvement in representation.
Non-abelian models showed partial circuit formation through Peter–Weyl decomposition.
High embedding similarity across operations indicates a common representational framework.

Abstract

This paper investigates the phenomenon of grokking in transformers across a broader class of algebraic structures beyond modular addition. Prior mechanistic interpretability work has shown that transformers trained on modular addition learn Fourier-based clock circuits and exhibit delayed generalisation (grokking). We extend this analysis to eight algebraic operations spanning abelian groups, a composite ring, and non-abelian groups (S3, D5, A4, S4), using 1-layer transformers at dₘodel = 64. Our key findings are: 1. A clear abelian vs non-abelian grokking boundary: all abelian operations achieve 100% test accuracy, while non-abelian groups fail to generalise despite perfect training accuracy. 2. Discrete-log re-indexing improves Fourier concentration for modular multiplication (2. 14×), supporting the discrete logarithm representation hypothesis. 3. Non-abelian models exhibit partial circuit formation via Peter–Weyl decomposition even without grokking. 4. Cross-operation embedding similarity (CKA ≥ 0. 80 across all pairs) suggests a shared representational substrate. 5. A capacity-dependent interpretation: abelian tasks rely on 1D irreducible representations, while non-abelian tasks require higher-dimensional irreps exceeding model capacity at dₘodel = 64. All experiments are reproducible via provided code and checkpoint-resume pipelines, runnable on a free Colab T4 GPU (~3 hours). This work contributes new empirical evidence toward understanding the role of algebraic structure and representation theory in neural network generalisation. Code repository: https: //github. com/justbytecode/grokking-beyond-addition

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mani Pal

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Grokking Beyond Addition: Circuit-Level Analysis of Algebraic Learning in Transformers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study