Large language models (LLMs) have been the driving force behind significant advancements in artificial intelligence. However, their unique self-attention mechanism leads to difficulties in accelerating the inference. Softmax, with its complex nonlinear operations and low parallelism, significantly limits the efficiency of LLMs for long sequences. This work proposes a novel high-parallelism hardware/software co-design Softmax solution. By incorporating a sum estimation algorithm, we eliminate the need for complex exponential on all elements. A high-speed low-energy hardware architecture is introduced by applying high-parallelism statistical module and simplified division module. Experimental results demonstrate that our approach achieves a 9.90%-44.75% reduction in latency and up to a 17.36% reduction in energy consumption compared to state-of-the-art Softmax hardware, with minimal impact on model inference accuracy.
Wen et al. (Thu,) studied this question.