What question did this study set out to answer?

The central aim is to improve the efficiency of Transformer softmax in low-power neural processing units (NPUs) for on-device inference of large language models (LLMs).

March 22, 2026Open Access

Attention Distribution-Aware Softmax for NPU-Accelerated On-Device Inference of LLMs: An Edge-Oriented Approximation Design

Key Points

The central aim is to improve the efficiency of Transformer softmax in low-power neural processing units (NPUs) for on-device inference of large language models (LLMs).
Proposed an attention distribution-aware softmax using particle swarm optimization (PSO) to define non-uniform segments.
Implemented a lookup table (LUT) with 128 bins for efficient parameter retrieval.
Focused on minimizing arithmetic complexity in attention-dense regions to enhance performance.
Reduced cycles per call for the exp kernel by 18.5% compared to a uniform Degree-4 baseline.
Achieved a reduction of 13.1% compared to a uniform Degree-3 setup.
Maintained ranking fidelity during performance enhancements.

Abstract

Low-power NPUs enable on-device LLM inference through efficient integer and fixed-point algebra, yet their lack of native exponential support makes Transformer softmax a critical performance bottleneck. Existing NPU kernels approximate using uniform piecewise polynomials to enable O(1) SIMD indexing, but this wastes computation by applying high-degree arithmetic indiscriminately in every segment. Conversely, fully adaptive approaches maximize statistical fidelity but introduce pipeline stalls due to comparator-based boundary search. To bridge this gap, we propose an attention distribution-aware softmax that uses Particle Swarm Optimization (PSO) to define non-uniform segments and variable polynomial degrees, prioritizing finer granularity and lower arithmetic complexity in attention-dense regions. To ensure efficiency, we snap boundaries into a 128-bin LUT, enabling O(1) retrieval of segment parameters without branching. Inference measurements show that this favors low-degree execution, minimizing exp-kernel overhead. Using TinyLlama-1.1B-Chat as a testbed, the proposed weighted design reduces cycles per call exp kernel (CPC) by 18.5% versus an equidistant uniform Degree-4 baseline and 13.1% versus uniform Degree-3, while preserving ranking fidelity. These results show that grid-snapped, variable-degree approximation can improve softmax efficiency while largely preserving attention ranking fidelity, enabling accurate edge LLM inference.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper