Key points are not available for this paper at this time.
Sampled Dense-Dense Matrix Multiplication is a fundamental operation in sparse linear algebra, widely used in graph neural networks and scientific computing. However, accelerating computations on GPUs is challenging due to data sparsity and irregular memory access, which hinder efficient use of Tensor Cores. This paper introduces a Block-Structured Matrix Reordering framework that improves Tensor Core utilization by reorganizing sparse matrices using bi-directional reordering with weighted similarity metrics. We also propose a tile-aware sparse matrix format that improves memory access and task scheduling. To enable adaptive and balanced computation, we employ a dual-path execution strategy: dense matrix blocks are assigned to Tensor Cores, while sparse blocks are handled by CUDA Cores. Experiments on the RTX 4090 demonstrate that our method achieves up to a 10. 38 speedup over the best Tensor Core baseline and 7. 31 over the best CUDA Core baseline by producing denser block structures and enhancing parallelism.
Zou et al. (Fri,) studied this question.