What question did this study set out to answer?

To investigate the limitations of traditional transformers in generalizing beyond training lengths on arithmetic tasks.

May 27, 2026Open Access

Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Key Points

To investigate the limitations of traditional transformers in generalizing beyond training lengths on arithmetic tasks.
Introduced Dual-Head Attention with Gram-Schmidt-orthogonalized sine heads alongside standard cosine heads.
Trained an 883K-parameter model on N×N integer multiplication with 1-6 digit operands.
Evaluated performance on unseen 7-10 digit operands.
Achieved 80.6% exact-match accuracy on 7-10 digit operands.
Standard transformer with identical parameters scored near zero.
The model did not utilize a scratchpad or task-specific positional encoding.

Abstract

Transformers fail to generalize beyond training lengths on arithmetic tasks. We argue the root cause is geometric: dot-product attention projects onto the subspace spanned by training data, and cannot capture structural patterns that are orthogonal to content similarity. We introduce Dual-Head Attention, which adds Gram-Schmidt-orthogonalized sine heads alongside standard cosine heads. On N×N integer multiplication, an 883K-parameter model trained on 1-6 digit operands achieves 80.6% exact-match accuracy on 7-10 digit unseen operands, where a standard Transformer with identical capacity scores near zero. The model uses no scratchpad and no task-specific positional encoding. Code: https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper