What question did this study set out to answer?

This work aims to optimize the attention mechanism for large language models on many-core processors, specifically the MT-3000 architecture.

April 15, 2026Open Access

Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor

Key Points

This work aims to optimize the attention mechanism for large language models on many-core processors, specifically the MT-3000 architecture.
Identify challenges in attention mechanism for many-core processors
Develop DeferAttention with deferred-reduction strategy
Implement memory-centric operator design including data tiling and software pipelining
Integrate kernel selection strategy using an analytical cost model
DeferAttention achieves up to 98% theoretical peak efficiency at the micro-kernel level
Achieves 85% efficiency at the operator level
Significantly outperforms baseline attention implementations
Accelerates end-to-end inference for large language models

Abstract

Transformer-based large language models (LLM) are increasingly deployed in high-performance computing environments, where the attention mechanism often becomes a key bottleneck during inference. Although state-of-the-art attention algorithms (e.g., FlashAttention) achieve high efficiency on GPUs, they are ill-suited to emerging heterogeneous many-core processors. In this work, we focus on MT-3000, a representative architecture deployed in the new-generation Tianhe supercomputer, and identify three principal challenges in realizing high-performance attention: complex multi-tier memory requiring manual data movement, excessive reduction overhead caused by sub-tile softmax operations, and static execution pipelines that fail to adapt to inference phases and sequence lengths. To overcome these challenges, we propose DeferAttention , a high-performance attention implementation designed for the MT-3000 many-core processor. DeferAttention introduces a novel deferred-reduction attention strategy to decouple reduction from the fused compute pipeline, enabling more efficient aggregation over large tiles. Moreover, DeferAttention adopts a memory-centric operator design, including data tiling, multi-level software pipelining, and modular micro-kernels, to maximize data reuse and execution throughput. Finally, to support runtime-adaptive execution, DeferAttention integrates a lightweight kernel selection strategy guided by an analytical cost model. Experimental results show that DeferAttention achieves up to 98% of the theoretical peak at the micro-kernel level and 85% at the operator level, outperforming baseline implementations and significantly accelerating end-to-end inference.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper