Key points are not available for this paper at this time.
Large Language Models (LLMs) have profoundly changed the world. Their self-attention mechanism is the key to the success of transformers in LLMs. However, the quadratic computational cost O (n²) to the length n input sequence is the notorious obstacle for further improvement and scalability in the longer context. In this work, we leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention computation using convolution matrices. We propose a conv basis system, "similar" to the rank basis, and show that any lower triangular (attention) matrix can always be decomposed as a sum of k structured convolution matrices in this basis system. We then design an algorithm to quickly decompose the attention matrix into k convolution matrices. Thanks to Fast Fourier Transforms (FFT), the attention inference can be computed in O (knd n) time, where d is the hidden dimension. In practice, we have d n, i. e. , d=3, 072 and n=1, 000, 000 for Gemma. Thus, when kd = n^o (1), our algorithm achieve almost linear time, i. e. , n^1+o (1). Furthermore, the attention training forward and backward gradient can be computed in n^1+o (1) as well. Our approach can avoid explicitly computing the n n attention matrix, which may largely alleviate the quadratic computational complexity. Furthermore, our algorithm works on any input matrices. This work provides a new paradigm for accelerating attention computation in transformers to enable their application to longer contexts.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiuxiang Gu
Yingyu Liang
Heshan Liu
Building similarity graph...
Analyzing shared references across papers
Loading...
Gu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e6b14fb6db6435876333c2 — DOI: https://doi.org/10.48550/arxiv.2405.05219