Three-dimensional compute-in-memory (3D-CIM) architectures, with their high bandwidth and strong parallelism, provide significant opportunities for accelerating graph neural network (GNN) inference. However, existing 3D-CIM accelerators still face two major challenges when handling deep graph neural networks (DeGNNs): (1) insufficient support for layer-wise sparsity, where zero values lead to redundant memory accesses and ineffective computations, resulting in reduced bandwidth utilization and increased latency; and (2) lack of native support for cross-layer residual dependencies, where frequent data movement incurs additional storage and communication overhead, further exacerbating inference latency. To address these issues, we propose G3DMA a high-efficiency 3D-CIM accelerator designed through hardware-software co-optimization for DeGNN inference. For sparse encoding, G3DMA employs differentiated compression: adjacency matrices are stored using Dual-Bitmap Sparse Representation (DBSR), while node features adopt a bitmap-value separated Block Sparse Representation (BSR), significantly reducing DRAM access overhead while improving compression ratio and indexing efficiency. At the execution level, we design a three-stage model—“combination-intermediate aggregation-residual accumulation”—augmented with sparsity-aware scheduling and zero-skipping, thereby avoiding full materialization of intermediate results and reducing ineffective computations. At the hardware level, G3DMA integrates lightweight compute arrays and dedicated codec units in the near-memory logic layer, efficiently supporting DBSR/BSR processing and block-wise memory accesses; it further implements a three-stage dataflow with pipelined control, residual-friendly accumulation paths, and low-overhead cross-vault routing. Experimental results demonstrate that G3DMA achieves speedups of 6007.41× and 106.28× over advanced CPU and GPU platforms, respectively. Compared with the latest state-of-the-art (SOTA) accelerators—HyGCN, GCIM, GCNim, and SGCN—G3DMA delivers 26.27×, 11.93×, 1.46×, and 2.42× performance improvements, respectively, and consistently outperforms SOTA designs in both performance and energy efficiency.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhenyu Long
Yu Zhang
Yutao FU
Scientia Sinica Informationis
Building similarity graph...
Analyzing shared references across papers
Loading...
Long et al. (Fri,) studied this question.
synapsesocial.com/papers/6a0ff42fd674f7c03778d512 — DOI: https://doi.org/10.1360/ssi-2025-0381
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: