Abstract This paper introduces spectral attention, which filters the attention score matrix directly in the frequency domain via FFT/IFFT with learnable, per-head masks. This complements the time-domain view by enabling explicit control over low-, mid-, and high-frequency components of attention patterns. We study nine variants, including an adaptive mechanism that modulates masks from input content. On WikiText-2, Penn Treebank, and WikiText-103, the adaptive spectral variant consistently improves over standard attention, reducing perplexity by 10.7% on WikiText-2 and 15.3% on WikiText-103 in our setup. Analysis shows low-frequency components carry the most useful signal and that learned frequency preferences outperform fixed low/high/band-pass filters. These results indicate that frequency-domain processing is an effective complement for autoregressive transformer language modeling in our evaluated settings.
Huang et al. (Tue,) studied this question.