High throughput inference serving is important for applications taking large language models (LLMs) as their kernels. However, traditional inference frameworks mostly suffer from the bubbles extensively existing in the inference pipeline. Research works have proposed to group multiple requests into batches and schedule these batches efficiently thus reduce the request-level and batch-level bubbles, but rarely pay attention to the bubbles within each decode iteration. Actually, tokens generated in the same iteration may have different costs depending on their relied KVCache, where a token relying on a very long KVCache is likely to be the bottleneck within the iteration, and consequently the iteration-level bubbles occur since other tokens must wait for a long time to enter into the next iteration. In this work, we propose a novel prefix-aware batching policy to group requests whose relied KVCache are of the similar length into a batch, guaranteeing that bubbles within each iteration are eliminated. To efficiently support the prefix-aware batching, we design a new inference framework called AlignedServe, which leverages the large CPU memory to accommodate a sufficient amount of in-flight requests prepared for being batched. Batches generated in CPU memory are further scheduled by a well-designed batch-level scheduling policy, which guarantees that the batch-level bubbles are significantly reduced. To reduce the latency involved in transmitting KVCache from CPU memory to GPU HBM, we propose to leverage one GPU to prefetch KVCache for another. To the best of our knowledge, this is the first work employing the GPU-Prefetch-For-GPU architecture. We evaluate AlignedServe via extensive experiments driven by both synthetic and application workloads. The experimental results demonstrate that AlignedServe improves the decoding throughput by a maximum of 1.98 and reduces the latency by up to 7.4 compared to the state-of-the-art systems.
Bai et al. (Mon,) studied this question.