Transformer-based large language models (LLM) are increasingly deployed in high-performance computing environments, where the attention mechanism often becomes a key bottleneck during inference. Although state-of-the-art attention algorithms (e.g., FlashAttention) achieve high efficiency on GPUs, they are ill-suited to emerging heterogeneous many-core processors. In this work, we focus on MT-3000, a representative architecture deployed in the new-generation Tianhe supercomputer, and identify three principal challenges in realizing high-performance attention: complex multi-tier memory requiring manual data movement, excessive reduction overhead caused by sub-tile softmax operations, and static execution pipelines that fail to adapt to inference phases and sequence lengths. To overcome these challenges, we propose DeferAttention , a high-performance attention implementation designed for the MT-3000 many-core processor. DeferAttention introduces a novel deferred-reduction attention strategy to decouple reduction from the fused compute pipeline, enabling more efficient aggregation over large tiles. Moreover, DeferAttention adopts a memory-centric operator design, including data tiling, multi-level software pipelining, and modular micro-kernels, to maximize data reuse and execution throughput. Finally, to support runtime-adaptive execution, DeferAttention integrates a lightweight kernel selection strategy guided by an analytical cost model. Experimental results show that DeferAttention achieves up to 98% of the theoretical peak at the micro-kernel level and 85% at the operator level, outperforming baseline implementations and significantly accelerating end-to-end inference.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xinxin Qi
Jianbin Fang
Peng Zhang
ACM Transactions on Architecture and Code Optimization
National University of Defense Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Qi et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2abce4eeef8a2a6afbfa — DOI: https://doi.org/10.1145/3807449