What does this research mean for the field?

Apparent structural bottlenecks in linear attention models are often artifacts of insufficient training budgets, intermediate convolution kernels cause systematic performance collapse, and single-seed evaluations are fundamentally unreliable. Novelty: ClaimNovelty.CONTRADICTORY. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

This research investigates the training dynamics and kernel behavior of CubeAttn B+ in linear attention models.

June 6, 2026Open Access

CubeAttn: Training Dynamics, Kernel Failure Modes, and Seed Sensitivity in Linear Attention

Read Full Paperexternally

Key Points

This research investigates the training dynamics and kernel behavior of CubeAttn B+ in linear attention models.
Conducted a systematic empirical study with varying training steps (3000 and 6000) and architectural variants (9 types).
Performed kernel ablation analysis to identify failure modes in convolution kernels with 5-seed validation.
Analyzed performance metrics such as Induction Head accuracy and Long-Range Recall across different kernel sizes.
Extending training from 3000 to 6000 steps improved Induction Head accuracy from 31.6% to 100%, but decreased Reversed Copy accuracy from 97% to 47%.
Found a U-shaped kernel failure curve where intermediate kernels (k=5-11) caused significant performance drops on Induction Head accuracy.
5-seed validation showed that single-seed metrics can be misleading, with Long-Range Recall overestimated by up to 50% and Induction Head stability varying over 80 percentage points.

Abstract

We present a systematic empirical study of a linear-complexity attention operator (CubeAttn B+) that replaces the cross-token attention matrix with a lightweight global aggregation mechanism. Our investigation spans three experimental phases with increasing rigor: an initial four-operator comparison at 3000 training steps, a follow-up at 6000 steps with 9 architectural variants, and a kernel ablation with 5-seed validation. We report three findings with methodological implications beyond our specific operator. First, extending training from 3000 to 6000 steps increases baseline Induction Head accuracy from 31.6% to 100%, revealing that the apparent "structural bottleneck" observed in prior work was primarily a training budget artifact — yet the same extension causes Reversed Copy accuracy to drop from 97% to 47%, exposing a training–structure interaction. Second, we discover a U-shaped kernel failure curve: intermediate convolution kernels (k=5–11) cause systematic collapse on Induction Head (as low as 20%), while both smaller (k=3) and larger (k=15) kernels perform well (>96%), partitioning the kernel space into two effective regimes with no viable intermediate option. Third, 5-seed validation reveals that single-seed results are fundamentally unreliable: Long-Range Recall is overestimated by up to 50% (16.8% vs 11.6% mean), while Induction Head exhibits 80+ percentage-point swings under identical conditions (11.7%–96.9%) — yet Reversed Copy remains perfectly stable at 100% across all seeds. This pattern decoupling isolates absolute position encoding as the remaining structural bottleneck. k=3 and k=15 are statistically indistinguishable on LRR (difference < 1 std dev), with k=3 offering lower compute and zero Copy degradation. We recommend that all linear attention evaluations report at least 5-seed statistics and consider training budget as a first-class experimental variable.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yahua Ruan

International Game Technology (Germany)

Actions

Institutions

International Game Technology (Germany)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CubeAttn: Training Dynamics, Kernel Failure Modes, and Seed Sensitivity in Linear Attention

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study