Key points are not available for this paper at this time.
Transformer networks have outperformed recurrent neural networks and convolutional neural networks in various sequential tasks. However, scaling transformer networks for long sequences has been challenging because of memory and compute bottlenecks. Transformer networks are impeded by memory bandwidth limitations because of their low operation per byte ratio resulting in low utilization of GPU's computing resources. In-memory processing can mitigate memory bottlenecks by eliminating the transfer time between memory and compute units. Furthermore, transformer networks use neural attention mechanisms to characterize the relationships between sequence elements. Efficient hardware solutions have been proposed to implement efficient attention mechanisms, which include ternary content addressable memories (TCAM), crossbar arrays (XBars), and processing in-memory (PIM). However, these solutions do not implement a multi-head self-attention mechanism. We propose using a combination of XBars and CAMs to accelerate transformer networks. We improve the speed of transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, (3) exploiting the available parallelism in the attention mechanism, and (4) using locality sensitive hashing to filter the number of sequence elements by their importance. Our approach achieves a 200x speedup and 41x energy improvement for a sequence length of 4098.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ann Franchesca Laguna
De La Salle University
Arman Kazemi
University of Notre Dame
Michael Niemier
University of Notre Dame
Building similarity graph...
Analyzing shared references across papers
Loading...
Laguna et al. (Mon,) studied this question.
synapsesocial.com/papers/6a11e4ccdb195b84738e1144 — DOI: https://doi.org/10.23919/date51398.2021.9474146