What question did this study set out to answer?

This research aims to improve the efficiency of large language model (LLM) inference through a new Softmax design.

March 14, 2026

A High-Parallelism Softmax Hardware-Software Co-Design for Fast and Efficient LLM Inference

Key Points

This research aims to improve the efficiency of large language model (LLM) inference through a new Softmax design.
Developed a high-parallelism hardware/software co-design for Softmax.
Utilized a sum estimation algorithm to reduce complexity.
Created a high-speed, low-energy hardware architecture with specific modules.
Achieved a 9.90%-44.75% reduction in latency compared to existing Softmax hardware.
Reduced energy consumption by up to 17.36% with minimal impact on accuracy.

Abstract

Large language models (LLMs) have been the driving force behind significant advancements in artificial intelligence. However, their unique self-attention mechanism leads to difficulties in accelerating the inference. Softmax, with its complex nonlinear operations and low parallelism, significantly limits the efficiency of LLMs for long sequences. This work proposes a novel high-parallelism hardware/software co-design Softmax solution. By incorporating a sum estimation algorithm, we eliminate the need for complex exponential on all elements. A high-speed low-energy hardware architecture is introduced by applying high-parallelism statistical module and simplified division module. Experimental results demonstrate that our approach achieves a 9.90%-44.75% reduction in latency and up to a 17.36% reduction in energy consumption compared to state-of-the-art Softmax hardware, with minimal impact on model inference accuracy.

Demander à l'IA

Bookmark

Cite This Study

Wen et al. (Thu,) studied this question.

synapsesocial.com/papers/69b4add218185d8a39801be6 https://doi.org/https://doi.org/10.1145/3801553

Demander à l'IA

Bookmark