August 7, 2024

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

Key Points

Key points are not available for this paper at this time.

Abstract

Large language models (LLMs) are becoming powerful engines for social productivity in the manufacturing lifecycle. Existing application-level LLMs inference services focus on large datacenter and small edge intelligence (EI) scenarios, adopting iteration-level batch schedulers to solve resource utilization and inference speed problems. However, these services are incompatible with the scene of medium-sized local heterogeneous graphics processing unit (GPU) clusters with specific patterns, whose scale is between the two aforementioned scenarios. This type of scene proposes tradeoff problems for inference resource and speed, as well as user satisfaction problems for the semisparse frequency of queries with streaming responses. We propose suboptimal load balancing (SLoB), a distributed LLMs inference service scheduler in medium-sized local heterogeneous GPU clusters. SLoB leverages a multilevel adapter to accommodate LLMs usage patterns of scenes and balance resource utilization with inference efficiency. For semisparse problems, it adopts a mixed-priority pipeline scheduler with the least-padding principle to improve users' satisfaction, a metric considering the weights of different tokens in streaming responses. Based on the system prototype, our experiments under simulated workloads demonstrate that SLoB gains a maximum improvement of 29. 4 under the satisfaction metric compared with the traditional run-to-completion scheduling solution while improving by up to 3. 0 compared with the state-of-the-art (SOTA) solution Orca.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Peiwen Jiang

H Wang

Zinuo Cai

Journals

IEEE Transactions on Computational Social Systems

Actions

Institutions

Shanghai Jiao Tong University

China University of Petroleum, East China

Kansai University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider