ABSTRACT Large‐scale model training in distributed data centers plays a crucial role in deep learning. Still, it faces significant challenges, including resource fragmentation, low bandwidth utilization, and complex task flow management. The problem is exacerbated by high‐speed, high‐capacity parameter synchronization, often exceeding several hundred Gbps, which leads to reduced throughput and computational inefficiencies. To address these challenges, this paper proposes an innovative approach that combines data parallelism, model parallelism, and dynamic load balancing. By integrating a Gray Markov Chain (GMC) and Markov Decision Process (MDP) model, the approach dynamically schedules resources and balances computational loads. The GMC model is used to predict future node loads, facilitating optimal weight matrix decomposition, while the MDP model adjusts data transmission paths to optimize network traffic management. The combination of these two models enhances both resource allocation and data flow optimization. Experimental results demonstrate that this integrated approach significantly improves throughput, resource utilization, and computational efficiency compared to traditional methods. The findings suggest that this hybrid approach performs exceptionally well in optimizing large‐scale distributed training tasks in multidata‐center environments, significantly improving the scalability and performance of deep learning workloads. This research shows promising implications for enhancing the efficiency and effectiveness of distributed training systems in high‐demand applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yonggang Li
Rui Ji
Yaotong Su
Concurrency and Computation Practice and Experience
Chongqing University of Posts and Telecommunications
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Tue,) studied this question.
www.synapsesocial.com/papers/6932311e8e51979591dce49a — DOI: https://doi.org/10.1002/cpe.70456
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: