Large language models (LLMs) have produced exceptional performance in many artificial intelligence tasks. Modern LLMs pre-trained on large datasets have demonstrated promising utility for diverse applications. However, to achieve state-of-the-art accuracy, such models often need to contain more than 50 billion parameters, which makes it expensive to work with these models due to the requirement of expensive hardware such as high-end GPUs. To expand access to LLMs, several research groups have open-sourced their pre-trained LLMs. However, the sheer size of these models makes it difficult to adopt them, even if no training is needed. Prior studies have shown that the bottleneck of running LLMs is not in the computation speed, but in the GPU memory. A 100B-parameter model will require 200 GB of GPU memory to load the model parameters at the standard half precision, and many LLMs are even larger. Providing this much GPU memory is financially challenging for many users. Recently, a model-parallel approach has been proposed to address the above challenge through resource pooling across distributed devices. This approach distributes a model to multiple devices at the granularity of transformer blocks (a.k.a. pipeline parallelism) or neurons (a.k.a. tensor parallelism) to run large models on devices with small GPUs. In particular, has shown that pipeline parallelism can be used to run LLM inference tasks on geographically distributed servers, each with only a few GB of GPU memory, at a much faster rate than local parameter offloading. This is achieved by letting each server host a subset of consecutive blocks, and each inference request is routed through a chain of servers that collectively host the entire model. In this work, we consider LLM inference over geographically distributed servers using pipeline parallelism and clientcentric communication, as illustrated in Fig. 1. Although the feasibility of such systems has been validated in, there is a lack of fundamental understanding of how to optimally manage their performance. The current solution in based on heuristics for resource allocation. It remains open how to optimize the performance of such systems, while taking into account the unique characteristics of GPUs and LLM inference tasks. This work aims to fill this gap by rigorously formulating and addressing the block placement and request routing problem, using PETALS as a concrete example.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tingyang Sun
Ting He
Bo Ji
ACM SIGMETRICS Performance Evaluation Review
Pennsylvania State University
Virginia Tech
Indian Institute of Science Bangalore
Building similarity graph...
Analyzing shared references across papers
Loading...
Sun et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69cf5f645a333a821460e926 — DOI: https://doi.org/10.1145/3797823.3797828