The network traffic of 3D parallel training in large-scale deep learning, featuring burstiness, hot-spots, and periodic large-bandwidth patterns, severely challenges network efficiency, necessitating a high-performance and flexible optical network solution. To address this, this paper proposes Mercury, a hybrid optical network based on physical optical components: its optical timeslot switching (OTS) subnet uses an arrayed waveguide grating router (AWGR) and tunable lasers for dynamic traffic, while the optical circuit switching (OCS) subnet relies on wavelength selective switches (WSSs) for low-latency high-bandwidth transmission, which is coordinated by selective valiant load balancing (S-VLB) and most efficient path configuration (MEPC) mechanisms. Validated via simulations and FPGA-based testbed experiments, Mercury outperforms the Sirius network by reducing epoch training time (e.g., 179s with five jobs) and relieving OTS congestion through offloading large flows to OCS. This work demonstrates that Mercury provides a flexible, high-performance physical optical solution for 3D parallel training of large-scale deep learning models.
Feng et al. (Mon,) studied this question.