What question did this study set out to answer?

The research aims to address performance challenges in deep graph neural networks through a high-efficiency 3D-CIM accelerator.

May 22, 2026Open Access

A high-efficiency 3D-stacked accelerator for deep graph neural network inference

Key Points

The research aims to address performance challenges in deep graph neural networks through a high-efficiency 3D-CIM accelerator.
Proposed G3DMA architecture Designed via hardware-software co-optimization
Implemented differentiated compression techniques for adjacency matrices and node features
Developed a three-stage model augmented with sparsity-aware scheduling and zero-skipping
G3DMA achieved speedups of 6007.41× compared to CPUs and 106.28× against GPUs
Outperformed state-of-the-art accelerators—HyGCN, GCIM, GCNim, and SGCN—by up to 26.27×
Consistently improved both performance and energy efficiency compared to existing designs

Abstract

Three-dimensional compute-in-memory (3D-CIM) architectures, with their high bandwidth and strong parallelism, provide significant opportunities for accelerating graph neural network (GNN) inference. However, existing 3D-CIM accelerators still face two major challenges when handling deep graph neural networks (DeGNNs): (1) insufficient support for layer-wise sparsity, where zero values lead to redundant memory accesses and ineffective computations, resulting in reduced bandwidth utilization and increased latency; and (2) lack of native support for cross-layer residual dependencies, where frequent data movement incurs additional storage and communication overhead, further exacerbating inference latency. To address these issues, we propose G3DMA a high-efficiency 3D-CIM accelerator designed through hardware-software co-optimization for DeGNN inference. For sparse encoding, G3DMA employs differentiated compression: adjacency matrices are stored using Dual-Bitmap Sparse Representation (DBSR), while node features adopt a bitmap-value separated Block Sparse Representation (BSR), significantly reducing DRAM access overhead while improving compression ratio and indexing efficiency. At the execution level, we design a three-stage model—“combination-intermediate aggregation-residual accumulation”—augmented with sparsity-aware scheduling and zero-skipping, thereby avoiding full materialization of intermediate results and reducing ineffective computations. At the hardware level, G3DMA integrates lightweight compute arrays and dedicated codec units in the near-memory logic layer, efficiently supporting DBSR/BSR processing and block-wise memory accesses; it further implements a three-stage dataflow with pipelined control, residual-friendly accumulation paths, and low-overhead cross-vault routing. Experimental results demonstrate that G3DMA achieves speedups of 6007.41× and 106.28× over advanced CPU and GPU platforms, respectively. Compared with the latest state-of-the-art (SOTA) accelerators—HyGCN, GCIM, GCNim, and SGCN—G3DMA delivers 26.27×, 11.93×, 1.46×, and 2.42× performance improvements, respectively, and consistently outperforms SOTA designs in both performance and energy efficiency.

Bookmark

View Full Paper

Bookmark

View Full Paper

A high-efficiency 3D-stacked accelerator for deep graph neural network inference

Key Points

Abstract

Cite This Study