What question did this study set out to answer?

This work aims to enhance the performance of large language model inference on geographically distributed servers by optimizing resource allocation.

April 3, 2026

Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Key Points

This work aims to enhance the performance of large language model inference on geographically distributed servers by optimizing resource allocation.
Investigated the block placement and request routing problems in distributed LLM systems.
Utilized pipeline parallelism to manage model distribution across devices with limited GPU memory.
Developed a rigorous formulation to optimize performance based on unique GPU characteristics.
Demonstrated that pipeline parallelism can significantly accelerate LLM inference requests compared to local offloading.
Proposed a novel resource allocation framework that improves performance in distributed settings.
Found that optimizing block placement and request routing enhances overall efficiency in large model operations.

Abstract

Large language models (LLMs) have produced exceptional performance in many artificial intelligence tasks. Modern LLMs pre-trained on large datasets have demonstrated promising utility for diverse applications. However, to achieve state-of-the-art accuracy, such models often need to contain more than 50 billion parameters, which makes it expensive to work with these models due to the requirement of expensive hardware such as high-end GPUs. To expand access to LLMs, several research groups have open-sourced their pre-trained LLMs. However, the sheer size of these models makes it difficult to adopt them, even if no training is needed. Prior studies have shown that the bottleneck of running LLMs is not in the computation speed, but in the GPU memory. A 100B-parameter model will require 200 GB of GPU memory to load the model parameters at the standard half precision, and many LLMs are even larger. Providing this much GPU memory is financially challenging for many users. Recently, a model-parallel approach has been proposed to address the above challenge through resource pooling across distributed devices. This approach distributes a model to multiple devices at the granularity of transformer blocks (a.k.a. pipeline parallelism) or neurons (a.k.a. tensor parallelism) to run large models on devices with small GPUs. In particular, has shown that pipeline parallelism can be used to run LLM inference tasks on geographically distributed servers, each with only a few GB of GPU memory, at a much faster rate than local parameter offloading. This is achieved by letting each server host a subset of consecutive blocks, and each inference request is routed through a chain of servers that collectively host the entire model. In this work, we consider LLM inference over geographically distributed servers using pipeline parallelism and clientcentric communication, as illustrated in Fig. 1. Although the feasibility of such systems has been validated in, there is a lack of fundamental understanding of how to optimally manage their performance. The current solution in based on heuristics for resource allocation. It remains open how to optimize the performance of such systems, while taking into account the unique characteristics of GPUs and LLM inference tasks. This work aims to fill this gap by rigorously formulating and addressing the block placement and request routing problem, using PETALS as a concrete example.

Ask AI

Helpful

Bookmark