The explosive growth of temporal graph data has led to significant training overheads for Dynamic Graph Neural Networks (DGNNs), a bottleneck primarily driven by massive data movement between host processors and storage arrays across conventional PCIe I/O buses. While near-data processing with Computational Storage Devices (CSDs) can alleviate this bottleneck, a single CSD is inherently incapable of meeting the terabyte-scale capacity requirements and complex sequence modeling demands of modern large-scale DGNNs. Horizontal scaling with multi-CSD clusters over standard PCIe topologies presents a viable, cost-effective solution, yet our in-depth profiling identifies two critical architectural bottlenecks in naive multi-CSD architectures: host-bounced memory copies significantly compromise inter-device communication efficiency, and sparse graph sampling frequently exceeds the capacity of the tightly constrained local DRAM of CSDs, resulting in excessive flash I/O and performance degradation. To address these interconnected bottlenecks, we propose M-DGNN, a hardware–software co-designed architecture optimized for standard PCIe interconnects. First, M-DGNN orchestrates direct peer-to-peer (P2P) DMA dataflows for inter-CSD hidden state exchange, completely bypassing host operating system intervention and reducing the context-switching overhead. Second, we design a host-assisted caching strategy with a Host-Pinned Memory Extension (HPME) mechanism, which leverages host-pinned memory as an asynchronous DMA extension pool to shield resource-constrained CSDs from high-latency flash I/O during structural subgraph sampling. Extensive experimental evaluations across seven large-scale dynamic graph datasets demonstrate that M-DGNN delivers up to a 6.2× end-to-end speedup over the state-of-the-art DGNN systems. This work establishes an efficient, scalable near-data computing paradigm for large-scale DGNN training.
Zhu et al. (Mon,) studied this question.