Key points are not available for this paper at this time.
Large language models (LLMs) are becoming powerful engines for social productivity in the manufacturing lifecycle. Existing application-level LLMs inference services focus on large datacenter and small edge intelligence (EI) scenarios, adopting iteration-level batch schedulers to solve resource utilization and inference speed problems. However, these services are incompatible with the scene of medium-sized local heterogeneous graphics processing unit (GPU) clusters with specific patterns, whose scale is between the two aforementioned scenarios. This type of scene proposes tradeoff problems for inference resource and speed, as well as user satisfaction problems for the semisparse frequency of queries with streaming responses. We propose suboptimal load balancing (SLoB), a distributed LLMs inference service scheduler in medium-sized local heterogeneous GPU clusters. SLoB leverages a multilevel adapter to accommodate LLMs usage patterns of scenes and balance resource utilization with inference efficiency. For semisparse problems, it adopts a mixed-priority pipeline scheduler with the least-padding principle to improve users' satisfaction, a metric considering the weights of different tokens in streaming responses. Based on the system prototype, our experiments under simulated workloads demonstrate that SLoB gains a maximum improvement of 29. 4 under the satisfaction metric compared with the traditional run-to-completion scheduling solution while improving by up to 3. 0 compared with the state-of-the-art (SOTA) solution Orca.
Building similarity graph...
Analyzing shared references across papers
Loading...
Peiwen Jiang
H Wang
Zinuo Cai
IEEE Transactions on Computational Social Systems
Shanghai Jiao Tong University
China University of Petroleum, East China
Kansai University
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e5d24cb6db6435875688b8 — DOI: https://doi.org/10.1109/tcss.2024.3423749
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: