May 22, 2024Open Access

Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning

Key Points

Key points are not available for this paper at this time.

Abstract

Distributed deep learning framework tools should aim at high efficiency of training and inference of distributed exascale deep learning algorithms. There are three major challenges in this endeavor: scalability, adaptivity and efficiency. Any future framework will need to be adaptively utilized for a variety of heterogeneous hardware and network environments and will thus be required to be capable of scaling from single compute node up to large clusters. Further, it should be efficiently integrated into popular frameworks such as TensorFlow, PyTorch, etc. This paper proposes a dynamically hybrid (hierarchy) distribution structure for distributed deep learning, taking advantage of flexible synchronization on both centralized and decentralized architectures, implementing multi-level fine-grain parallelism on distributed platforms. It is scalable as the number of compute nodes increases, and can also adapt to various compute abilities, memory structures and communication costs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yibo Wang

University of Michigan

Tongsheng Geng

University of California, Irvine

E. L. R. da Silva

Pontifícia Universidade Católica de Minas Gerais

Actions

Institutions

University of California, Irvine

Pontifícia Universidade Católica de Minas Gerais

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study