Abstract Efficient matrix multiplication remains one of the most important computational tasks in scientific computing, engineering simulations and data analysis. This work presents a high-performance implementation of double-precision general matrix multiplication designed for modern ×86 processors. The main objective is to approach the practical performance limits achievable through software-level optimization by exploiting register-level parallelism, cache hierarchy characteristics and thread-level parallel execution. The proposed method is built around an efficiently designed 6 × 8 micro kernel that utilizes vector registers and fused multiply-add operations. A two-level cache blocking strategy is used to increase data reuse and reduce memory traffic. The implementation also employs parallel processing with manual control of work distribution among processor cores to improve scalability on systems with different numbers of cores and cache sizes. The study evaluates performance on two contemporary x86 processors with different core counts and cache configurations. The implementation is compared to widely used numerical libraries. Results show that the presented approach achieves solid sustained performance and consistently outperforms the NumPy/OpenBLAS backend, while reaching a substantial fraction of the throughput provided by highly optimized libraries. Additional experiments include benchmarking and a performance model that explains the observed behavior in terms of arithmetic throughput and memory bandwidth. The work demonstrates that a combination of vector-optimized micro kernels, cache-aware blocking and multi threading can provide a portable and efficient solution for double-precision matrix multiplication on current ×86 architectures.
A. A. Hovhannisyan (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: