The self-attention mechanism has revolutionized sequence modeling but suffers from quadratic computational complexity with respect to sequence length, limiting its applicability to long sequences. We propose Sparse Projection Attention (SPA), a novel attention variant that leverages learnable sparse projections to reduce the effective dimensionality of queries and keys while maintaining expressive power. Our method is grounded in the Johnson–Lindenstrauss lemma and provides theoretical guarantees on distance preservation for fixed random projection variants. We introduce a comprehensive mathematical framework including error bounds, convergence analysis, and gradient dynamics. Experimental results on language modeling, machine translation, and long-range sequence classification demonstrate that SPA achieves up to 8× speedup in attention score computation, and approximately 2× end-to-end speedup, while maintaining competitive performance compared to standard attention and other efficient variants. The proposed approach offers an effective trade-off between computational efficiency and model expressivity for long-sequence tasks, making transformers more accessible for resource-constrained environments and real-time applications.
Alaoui et al. (Sat,) studied this question.