December 5, 2025

Dynamic Load Balancing for Distributed Large Model Training: A Hybrid Framework of Gray Markov Chain and MDP

Key Points

Dynamic load balancing significantly enhances resource allocation and throughput in distributed training environments.
Utilizing experimental methods shows a marked improvement in scalability and efficiency for deep learning workloads.
Analysis combines the strengths of dynamic load balancing and model parallelism to tackle the inefficiencies in resource usage.
The findings indicate broad implications for optimizing high-demand applications within data centers.

Abstract

ABSTRACT Large‐scale model training in distributed data centers plays a crucial role in deep learning. Still, it faces significant challenges, including resource fragmentation, low bandwidth utilization, and complex task flow management. The problem is exacerbated by high‐speed, high‐capacity parameter synchronization, often exceeding several hundred Gbps, which leads to reduced throughput and computational inefficiencies. To address these challenges, this paper proposes an innovative approach that combines data parallelism, model parallelism, and dynamic load balancing. By integrating a Gray Markov Chain (GMC) and Markov Decision Process (MDP) model, the approach dynamically schedules resources and balances computational loads. The GMC model is used to predict future node loads, facilitating optimal weight matrix decomposition, while the MDP model adjusts data transmission paths to optimize network traffic management. The combination of these two models enhances both resource allocation and data flow optimization. Experimental results demonstrate that this integrated approach significantly improves throughput, resource utilization, and computational efficiency compared to traditional methods. The findings suggest that this hybrid approach performs exceptionally well in optimizing large‐scale distributed training tasks in multidata‐center environments, significantly improving the scalability and performance of deep learning workloads. This research shows promising implications for enhancing the efficiency and effectiveness of distributed training systems in high‐demand applications.

AI से पूछें

Bookmark

Cite This Study

Li et al. (Tue,) studied this question.

synapsesocial.com/papers/6932311e8e51979591dce49a https://doi.org/https://doi.org/10.1002/cpe.70456

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AI से पूछें

Bookmark