Distributed large model inference is still in a dilemma of balancing cost and effect. Online scenarios require tensor parallelism to attain low latency, while the introduced intensive communications increase the cost. In contrast, pipeline parallelism enables high throughput with significantly reduced communication requirements, but it can not improve each request’s effectiveness. Once a parallelism strategy is selected, the performance metrics become fixed, making it challenging to balance competing objectives. In this paper, we present Liger+, a distributed large model inference system that is capable of achieving dynamic balance between latency and throughput on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger+ includes task-aware batch management and distributed runtime modules. The batch management module organizes batches based on features of discriminative and generative tasks and feeds the runtime module. The distributed runtime module strategically schedules computation and communication kernels across multiple requests onto multiple streams of multiple GPUs and enables the novel interleaved parallelism. First, it achieves precise control of kernel execution order efficiently by combining the CPU-GPU synchronization and the inter-stream synchronization. Second, it introduces the fine-grained resource mapping strategy and contention factor strategy to anticipate the penalty arising from resource contention. Third, it enables a higher degree of overlap by decomposing kernels into smaller, more manageable units at runtime. Extensive evaluations show that Liger+ can, in most cases, dynamically fit higher throughput demand and simultaneously achieve a better latency across models and devices, compared to fixed parallelism strategies. On a 4-device discriminative task, Liger+ reduces the P90 latency by 43.8 \(\% \) while maintaining the same throughput compared to the pipeline parallelism. Meanwhile, it improves the throughput by 1.53 × with improved P90 latency compared to the tensor parallelism. For a 4-device generative task, Liger+ achieves an average 1.15 × improvement in throughput and a 26.2 \(\% \) reduction in P90 latency compared to the tensor parallelism.
Wei et al. (Mon,) studied this question.