What question did this study set out to answer?

The aim is to enhance the performance and efficiency of training dynamic graph neural networks using multiple computational storage devices.

April 15, 2026Open Access

M-DGNN: Accelerating Large-Scale Dynamic Graph Neural Network Training via PCIe-Interconnected Multiple Computational Storage Devices

Key Points

The aim is to enhance the performance and efficiency of training dynamic graph neural networks using multiple computational storage devices.
Proposed M-DGNN integrates a hardware-software co-design for optimizing training.
Utilized direct peer-to-peer DMA dataflows to enhance inter-device communication.
Implemented a host-assisted caching strategy with host-pinned memory to reduce latency.
Achieved up to 6.2× end-to-end speedup compared to existing dynamic graph neural network systems.
Demonstrated improved efficiency in handling large-scale dynamic graph data.
Addressed critical bottlenecks related to data movement and system latency.

Abstract

The explosive growth of temporal graph data has led to significant training overheads for Dynamic Graph Neural Networks (DGNNs), a bottleneck primarily driven by massive data movement between host processors and storage arrays across conventional PCIe I/O buses. While near-data processing with Computational Storage Devices (CSDs) can alleviate this bottleneck, a single CSD is inherently incapable of meeting the terabyte-scale capacity requirements and complex sequence modeling demands of modern large-scale DGNNs. Horizontal scaling with multi-CSD clusters over standard PCIe topologies presents a viable, cost-effective solution, yet our in-depth profiling identifies two critical architectural bottlenecks in naive multi-CSD architectures: host-bounced memory copies significantly compromise inter-device communication efficiency, and sparse graph sampling frequently exceeds the capacity of the tightly constrained local DRAM of CSDs, resulting in excessive flash I/O and performance degradation. To address these interconnected bottlenecks, we propose M-DGNN, a hardware–software co-designed architecture optimized for standard PCIe interconnects. First, M-DGNN orchestrates direct peer-to-peer (P2P) DMA dataflows for inter-CSD hidden state exchange, completely bypassing host operating system intervention and reducing the context-switching overhead. Second, we design a host-assisted caching strategy with a Host-Pinned Memory Extension (HPME) mechanism, which leverages host-pinned memory as an asynchronous DMA extension pool to shield resource-constrained CSDs from high-latency flash I/O during structural subgraph sampling. Extensive experimental evaluations across seven large-scale dynamic graph datasets demonstrate that M-DGNN delivers up to a 6.2× end-to-end speedup over the state-of-the-art DGNN systems. This work establishes an efficient, scalable near-data computing paradigm for large-scale DGNN training.

M-DGNN: Accelerating Large-Scale Dynamic Graph Neural Network Training via PCIe-Interconnected Multiple Computational Storage Devices

Key Points

Abstract

Cite This Study