Key points are not available for this paper at this time.
The immense sizes of Large-scale language models (LLMs) have led to high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. Extensive experiments on production inference workloads demonstrate throughput improvement in inference, showing great advantages over state-of-the-art works.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhao et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68e786ffb6db6435876f9b4b — DOI: https://doi.org/10.1145/3627535.3638480
Juntao Zhao
Borui Wan
Chuan Wu
University of Hong Kong
Chinese University of Hong Kong
Seattle University
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: