What question did this study set out to answer?

The aim is to enhance the performance of GEMM operations for small and irregular matrices on CPUs.

April 7, 2026Open Access

AGP-GEMM: Adaptive Grouping and Partitioning Framework for Accelerating Small and Irregular Matrices on CPUs

Key Points

The aim is to enhance the performance of GEMM operations for small and irregular matrices on CPUs.
Proposed a core grouping mechanism to balance workloads among multi-core CPUs.
Developed an adaptive block partitioning algorithm for optimal tiling based on matrix dimensions.
Conducted experiments on the Kunpeng CPU platform to evaluate performance.
Achieved a peak acceleration of up to 2.1× compared to existing methods.
Attained an average speedup of 1.64× over the Kunpeng KML math library.
Demonstrated effectiveness in managing computational tasks with small and irregular matrices.

Abstract

General Matrix Multiplication (GEMM) is a fundamental computational kernel in scientific computing, serving as the foundation for numerous complex tasks. However, in practical applications, the performance of GEMM is often constrained by irregular matrix dimensions and the diversity of hardware architectures. In particular, when processing small and irregular matrices, GEMM typically exhibits reduced computational efficiency. To address these challenges, this paper proposes a GEMM acceleration method based on an adaptive core grouping strategy. The method consists of two key components: a core grouping mechanism that alleviates workload imbalance among multi-core CPUs, and an adaptive block partitioning algorithm that dynamically selects optimal tiling schemes according to the matrix dimensions, achieving both load balance and cache-friendly data access. Experimental results on the Kunpeng CPU platform demonstrate that the proposed method achieves significant performance improvements compared to the Kunpeng KML math library, reaching a peak acceleration of up to 2.1× and an average speedup of 1.64×. These results validate the effectiveness and efficiency of the proposed approach in handling small and irregular matrix computation scenarios.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper