What type of study is this?

This is a Experimental Study study.

October 15, 2025Open Access

An End-to-End Framework for Compiling Dense and Sparse Matrix-Vector Multiplications for FPGA-HBM Acceleration

Key Points

The framework enables significant performance improvements for memory-intensive tasks like graph processing and machine learning.
MVM kernels are effectively detected by the MATIO compiler, accurately identifying 90% of relevant kernels in practical applications.
VecMADS architecture achieves 1.5x higher throughput than traditional GPU libraries for matrix-vector multiplication tasks.
The system capitalizes on high-bandwidth memory to mitigate memory bottlenecks while handling both dense and sparse operations.

Abstract

The bandwidth improvement provided by high-bandwidth memory (HBM), and the capability of FPGAs to customize the processing and memory hierarchy, results in a considerable performance increase for memory-intensive workloads such as graph processing, sorting, machine learning, and database analytics. Modern systems integrating 3D-stacked DRAM memory can be leveraged to realize the Near-Memory Computing (NMC) paradigm by offloading some computations to accelerators placed near the HBM. Matrix-vector multiplication (MVM) kernels, which are memory-bound, can significantly benefit from being executed on FPGA-HBM platforms. MVM kernels can be broadly categorized into two types: dense (General Matrix-Vector Multiplication, GEMV) and sparse (Sparse Matrix-Vector Multiplication, SpMV). Recent literature has predominantly focused on optimizing SpMV for FPGA-HBM, leaving a unified solution relatively unexplored. In this work, we introduce an end-to-end framework for compiling MVM kernels for FPGA-HBM Acceleration. It consists of a software and a hardware components. The software component introduces the MATIO compiler, a novel toolflow for detecting MVM and matrix multiplication (MM) kernels in C or C++ code, and replacing MVM kernels with a call to our FPGA accelerator. MATIO is capable of detecting 90% of MVM and MM kernels in real-world benchmarks collected from Github. Additionally, it is faster than state-of-the-art detection methods by at least 45x. On the hardware side, we introduce VecMADS, a novel FPGA architecture designed to efficiently handle both GEMV and SpMV operations. Our architecture leverages the high memory bandwidth of HBM to overcome memory bottlenecks, providing a comprehensive solution for accelerating matrix-vector multiplication on FPGAs. Evaluation results show that VecMADS delivers 1.5x higher throughput and 4.8x higher energy efficiency compared to cuSPARSE library on GPU. Considering dense benchmarks, VecMADS achieves 1.26x higher throughput than the hipBLAS library running on GPU.

An End-to-End Framework for Compiling Dense and Sparse Matrix-Vector Multiplications for FPGA-HBM Acceleration

Key Points

Abstract

Cite This Study