Key points are not available for this paper at this time.
The self-attention mechanism is the performance bottleneck of Transformer-based language models, particularly for long sequences. Researchers have proposed using sparse attention to speed up the Transformer. However, sparse attention introduces significant random access overhead, limiting computational efficiency. To mitigate this issue, researchers attempt to improve data reuse by utilizing row/column locality. Unfortunately, we find that sparse attention does not naturally exhibit strong row/column locality, but instead has excellent diagonal locality. Thus, it is worthwhile to use diagonal compression (DIA) format. However, existing sparse matrix computation paradigms struggle to efficiently support DIA format in attention computation. To address this problem, we propose ASADI, a novel software-hardware co-designed sparse attention accelerator. In the soft-ware side, we propose a new sparse matrix computation paradigm that directly supports the DIA format in self-attention computation. In the hardware side, we present a novel sparse attention accelerator that efficiently implements our computation paradigm using highly parallel in-situ computing. We thoroughly evaluate ASADI across various models and datasets. Our experimental results demonstrate an average performance improvement of 18.6 × and energy savings of 2.9× compared to a PIM-based baseline.
Li et al. (Sat,) studied this question.
Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context: