Three-dimensional compute-in-memory (3D-CIM) architectures, with their high bandwidth and strong parallelism, provide significant opportunities for accelerating graph neural network (GNN) inference. However, existing 3D-CIM accelerators still face two major challenges when handling deep graph neural networks (DeGNNs): (1) insufficient support for layer-wise sparsity, where zero values lead to redundant memory accesses and ineffective computations, resulting in reduced bandwidth utilization and increased latency; and (2) lack of native support for cross-layer residual dependencies, where frequent data movement incurs additional storage and communication overhead, further exacerbating inference latency. To address these issues, we propose G3DMA a high-efficiency 3D-CIM accelerator designed through hardware-software co-optimization for DeGNN inference. For sparse encoding, G3DMA employs differentiated compression: adjacency matrices are stored using Dual-Bitmap Sparse Representation (DBSR), while node features adopt a bitmap-value separated Block Sparse Representation (BSR), significantly reducing DRAM access overhead while improving compression ratio and indexing efficiency. At the execution level, we design a three-stage model—“combination-intermediate aggregation-residual accumulation”—augmented with sparsity-aware scheduling and zero-skipping, thereby avoiding full materialization of intermediate results and reducing ineffective computations. At the hardware level, G3DMA integrates lightweight compute arrays and dedicated codec units in the near-memory logic layer, efficiently supporting DBSR/BSR processing and block-wise memory accesses; it further implements a three-stage dataflow with pipelined control, residual-friendly accumulation paths, and low-overhead cross-vault routing. Experimental results demonstrate that G3DMA achieves speedups of 6007.41× and 106.28× over advanced CPU and GPU platforms, respectively. Compared with the latest state-of-the-art (SOTA) accelerators—HyGCN, GCIM, GCNim, and SGCN—G3DMA delivers 26.27×, 11.93×, 1.46×, and 2.42× performance improvements, respectively, and consistently outperforms SOTA designs in both performance and energy efficiency.
Long et al. (Fri,) studied this question.