June 27, 2024Open Access

Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

Key Points

Key points are not available for this paper at this time.

Abstract

Training extremely large language models with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when training models with billions of parameters. We then propose a modification to the partitioning algorithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and Llama-2 models demonstrates the updated algorithm's ability to achieve reliable convergence on these massive models, where stock ZeRO++ hpZ fails to converge. The updated algorithm enables robust training of larger models with 98\% throughput and model training speed improvement without sacrificing the quality of convergence.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Dai et al. (Thu,) studied this question.

synapsesocial.com/papers/68e63276b6db6435875c3c58 https://doi.org/https://doi.org/10.48550/arxiv.2407.01614

Demander à l'IA

Bookmark

View Full Paper