April 17, 2024

Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

Key Points

Key points are not available for this paper at this time.

Abstract

Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This paper presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kaiyang Liu

Jingrong Wang

Zhiming Huang

Journals

IEEE Transactions on Parallel and Distributed Systems

Actions

Institutions

University of Toronto

University of Victoria

Memorial University of Newfoundland

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Liu et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e6ebe4b6db643587666ede — DOI: https://doi.org/10.1109/tpds.2024.3390109

Also consider

Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context:

Resource Allocation Problems· 2013 · 29 citations
Optimal static load balancing in distributed computer systems· 1985 · 393 citations
Networks, crowds, and markets reasoning about a highly connected world· 2,741 citations

Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider