General Matrix Multiplication (GEMM) is a fundamental operation in high-performance computing (HPC) and deep learning (DL) applications. While mainstream linear algebra libraries on CPUs, such as MKL and OpenBLAS, achieve high performance for individual, large-scale, and regular-shaped GEMM operations, which are commonly used in emerging HPC and DL applications. We present FlashGEMM , a novel and efficient approach for optimizing sequences of GEMM on x86 CPUs. FlashGEMM introduces a new data packing strategy that reduces the memory access overhead associated with packing operations. It also offers new micro-kernels designed to fully utilize the Vector Neural Network Instructions (VNNI) units of x86 CPUs, thereby increasing the compute-to-memory ratio (CMR). Additionally, FlashGEMM includes new loop fusion strategies to reuse intermediate data across consecutive GEMM operations. Experimental results demonstrate that FlashGEMM can outperform state-of-the-art across most GEMM workloads on multi-core CPUs.
Zhang et al. (Sat,) studied this question.