What question did this study set out to answer?

The objective is to maximize the performance limits of double-precision matrix multiplication through software optimization.

June 18, 2026

High-Performance DGEMM Implementation on Modern X86 CPUs

Key Points

The objective is to maximize the performance limits of double-precision matrix multiplication through software optimization.
Developed a high-performance double-precision general matrix multiplication for x86 processors.
Implemented a 6 × 8 micro kernel using vector registers and fused multiply-add operations.
Used a two-level cache blocking strategy and manual core work distribution to enhance scalability.
Achieved solid sustained performance, outperforming NumPy/OpenBLAS with significantly higher throughput.
Demonstrated efficient scalability across different core counts and cache sizes.
Benchmarked performance indicating improved arithmetic throughput and memory bandwidth compared to standard libraries.

Abstract

Abstract Efficient matrix multiplication remains one of the most important computational tasks in scientific computing, engineering simulations and data analysis. This work presents a high-performance implementation of double-precision general matrix multiplication designed for modern ×86 processors. The main objective is to approach the practical performance limits achievable through software-level optimization by exploiting register-level parallelism, cache hierarchy characteristics and thread-level parallel execution. The proposed method is built around an efficiently designed 6 × 8 micro kernel that utilizes vector registers and fused multiply-add operations. A two-level cache blocking strategy is used to increase data reuse and reduce memory traffic. The implementation also employs parallel processing with manual control of work distribution among processor cores to improve scalability on systems with different numbers of cores and cache sizes. The study evaluates performance on two contemporary x86 processors with different core counts and cache configurations. The implementation is compared to widely used numerical libraries. Results show that the presented approach achieves solid sustained performance and consistently outperforms the NumPy/OpenBLAS backend, while reaching a substantial fraction of the throughput provided by highly optimized libraries. Additional experiments include benchmarking and a performance model that explains the observed behavior in terms of arithmetic throughput and memory bandwidth. The work demonstrates that a combination of vector-optimized micro kernels, cache-aware blocking and multi threading can provide a portable and efficient solution for double-precision matrix multiplication on current ×86 architectures.

Bookmark

High-Performance DGEMM Implementation on Modern X86 CPUs

Key Points

Abstract

Cite This Study

Also Consider

Also Consider