Large language models (LLMs) have been the driving force behind significant advancements in artificial intelligence. However, their unique self-attention mechanism leads to difficulties in accelerating the inference. Softmax, with its complex nonlinear operations and low parallelism, significantly limits the efficiency of LLMs for long sequences. This work proposes a novel high-parallelism hardware/software co-design Softmax solution. By incorporating a sum estimation algorithm, we eliminate the need for complex exponential on all elements. A high-speed low-energy hardware architecture is introduced by applying high-parallelism statistical module and simplified division module. Experimental results demonstrate that our approach achieves a 9.90%-44.75% reduction in latency and up to a 17.36% reduction in energy consumption compared to state-of-the-art Softmax hardware, with minimal impact on model inference accuracy.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chenyi Wen
Haonan Du
Xuyang He
ACM Transactions on Design Automation of Electronic Systems
Zhejiang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Wen et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69b4add218185d8a39801be6 — DOI: https://doi.org/10.1145/3801553