What question did this study set out to answer?

The aim is to develop a system that dynamically balances latency and throughput during distributed large model inference.

February 25, 2026Open Access

Dynamic Latency-Throughput Balancing in Distributed Large Model Inference with Interleaved Parallelism

Key Points

The aim is to develop a system that dynamically balances latency and throughput during distributed large model inference.
Implemented dynamic interleaved parallelism to optimize computation and communication.
Developed a task-aware batch management module to enhance task handling.
Utilized a distributed runtime module for efficient scheduling of resources across multiple GPUs.
Achieved a 43.8% reduction in P90 latency while maintaining throughput compared to pipeline parallelism.
Improved throughput by 1.53 times with better P90 latency compared to tensor parallelism.
For generative tasks, realized a 1.15 times improvement in throughput and a 26.2% reduction in latency.

Abstract

Distributed large model inference is still in a dilemma of balancing cost and effect. Online scenarios require tensor parallelism to attain low latency, while the introduced intensive communications increase the cost. In contrast, pipeline parallelism enables high throughput with significantly reduced communication requirements, but it can not improve each request’s effectiveness. Once a parallelism strategy is selected, the performance metrics become fixed, making it challenging to balance competing objectives. In this paper, we present Liger+, a distributed large model inference system that is capable of achieving dynamic balance between latency and throughput on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger+ includes task-aware batch management and distributed runtime modules. The batch management module organizes batches based on features of discriminative and generative tasks and feeds the runtime module. The distributed runtime module strategically schedules computation and communication kernels across multiple requests onto multiple streams of multiple GPUs and enables the novel interleaved parallelism. First, it achieves precise control of kernel execution order efficiently by combining the CPU-GPU synchronization and the inter-stream synchronization. Second, it introduces the fine-grained resource mapping strategy and contention factor strategy to anticipate the penalty arising from resource contention. Third, it enables a higher degree of overlap by decomposing kernels into smaller, more manageable units at runtime. Extensive evaluations show that Liger+ can, in most cases, dynamically fit higher throughput demand and simultaneously achieve a better latency across models and devices, compared to fixed parallelism strategies. On a 4-device discriminative task, Liger+ reduces the P90 latency by 43.8 \(\% \) while maintaining the same throughput compared to the pipeline parallelism. Meanwhile, it improves the throughput by 1.53 × with improved P90 latency compared to the tensor parallelism. For a 4-device generative task, Liger+ achieves an average 1.15 × improvement in throughput and a 26.2 \(\% \) reduction in P90 latency compared to the tensor parallelism.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper