We present a systematic empirical study of a linear-complexity attention operator (CubeAttn B+) that replaces the cross-token attention matrix with a lightweight global aggregation mechanism. Our investigation spans three experimental phases with increasing rigor: an initial four-operator comparison at 3000 training steps, a follow-up at 6000 steps with 9 architectural variants, and a kernel ablation with 5-seed validation. We report three findings with methodological implications beyond our specific operator. First, extending training from 3000 to 6000 steps increases baseline Induction Head accuracy from 31.6% to 100%, revealing that the apparent "structural bottleneck" observed in prior work was primarily a training budget artifact — yet the same extension causes Reversed Copy accuracy to drop from 97% to 47%, exposing a training–structure interaction. Second, we discover a U-shaped kernel failure curve: intermediate convolution kernels (k=5–11) cause systematic collapse on Induction Head (as low as 20%), while both smaller (k=3) and larger (k=15) kernels perform well (>96%), partitioning the kernel space into two effective regimes with no viable intermediate option. Third, 5-seed validation reveals that single-seed results are fundamentally unreliable: Long-Range Recall is overestimated by up to 50% (16.8% vs 11.6% mean), while Induction Head exhibits 80+ percentage-point swings under identical conditions (11.7%–96.9%) — yet Reversed Copy remains perfectly stable at 100% across all seeds. This pattern decoupling isolates absolute position encoding as the remaining structural bottleneck. k=3 and k=15 are statistically indistinguishable on LRR (difference < 1 std dev), with k=3 offering lower compute and zero Copy degradation. We recommend that all linear attention evaluations report at least 5-seed statistics and consider training budget as a first-class experimental variable.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yahua Ruan
International Game Technology (Germany)
International Game Technology (Germany)
Building similarity graph...
Analyzing shared references across papers
Loading...
Yahua Ruan (Thu,) studied this question.
synapsesocial.com/papers/6a23bbeb71a5da9775e775b0 — DOI: https://doi.org/10.5281/zenodo.20540451