What type of study is this?

This is a Quantitative Study study.

September 17, 2025Open Access

FlashGEMM: Optimizing Sequences of Matrix Multiplication by Exploiting Data Reuse on CPUs

Key Points

FlashGEMM can significantly enhance performance for GEMM operations on multi-core CPUs, showing great promise in high-performance computing.
The innovative data packing strategy reduces memory access overhead, leading to improved execution efficiency during matrix multiplication tasks.
FlashGEMM employs micro-kernels that utilize Vector Neural Network Instructions, boosting the compute-to-memory ratio in CPU operations.
New loop fusion strategies help reuse intermediate data effectively across consecutive GEMM operations, maximizing computational efficiency.

Abstract

General Matrix Multiplication (GEMM) is a fundamental operation in high-performance computing (HPC) and deep learning (DL) applications. While mainstream linear algebra libraries on CPUs, such as MKL and OpenBLAS, achieve high performance for individual, large-scale, and regular-shaped GEMM operations, which are commonly used in emerging HPC and DL applications. We present FlashGEMM , a novel and efficient approach for optimizing sequences of GEMM on x86 CPUs. FlashGEMM introduces a new data packing strategy that reduces the memory access overhead associated with packing operations. It also offers new micro-kernels designed to fully utilize the Vector Neural Network Instructions (VNNI) units of x86 CPUs, thereby increasing the compute-to-memory ratio (CMR). Additionally, FlashGEMM includes new loop fusion strategies to reuse intermediate data across consecutive GEMM operations. Experimental results demonstrate that FlashGEMM can outperform state-of-the-art across most GEMM workloads on multi-core CPUs.

FlashGEMM: Optimizing Sequences of Matrix Multiplication by Exploiting Data Reuse on CPUs

Key Points

Abstract

Cite This Study