What question did this study set out to answer?

This work aims to enhance LLM serving efficiency by addressing both request-level and iteration-level bubbles through a novel batching policy.

May 20, 2026

AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System

Key Points

This work aims to enhance LLM serving efficiency by addressing both request-level and iteration-level bubbles through a novel batching policy.
Developed AlignedServe framework utilizing prefix-aware batching based on KVCache lengths.
Implemented a batch-level scheduling policy in CPU memory to optimize request handling.
Employed GPU prefetching to minimize KVCache transmission latency between CPU and GPU.
Achieved up to 1.98x improvement in decoding throughput compared to existing systems.
Reduced latency by as much as 7.4 times relative to the state-of-the-art solutions.

Abstract

High throughput inference serving is important for applications taking large language models (LLMs) as their kernels. However, traditional inference frameworks mostly suffer from the bubbles extensively existing in the inference pipeline. Research works have proposed to group multiple requests into batches and schedule these batches efficiently thus reduce the request-level and batch-level bubbles, but rarely pay attention to the bubbles within each decode iteration. Actually, tokens generated in the same iteration may have different costs depending on their relied KVCache, where a token relying on a very long KVCache is likely to be the bottleneck within the iteration, and consequently the iteration-level bubbles occur since other tokens must wait for a long time to enter into the next iteration. In this work, we propose a novel prefix-aware batching policy to group requests whose relied KVCache are of the similar length into a batch, guaranteeing that bubbles within each iteration are eliminated. To efficiently support the prefix-aware batching, we design a new inference framework called AlignedServe, which leverages the large CPU memory to accommodate a sufficient amount of in-flight requests prepared for being batched. Batches generated in CPU memory are further scheduled by a well-designed batch-level scheduling policy, which guarantees that the batch-level bubbles are significantly reduced. To reduce the latency involved in transmitting KVCache from CPU memory to GPU HBM, we propose to leverage one GPU to prefetch KVCache for another. To the best of our knowledge, this is the first work employing the GPU-Prefetch-For-GPU architecture. We evaluate AlignedServe via extensive experiments driven by both synthetic and application workloads. The experimental results demonstrate that AlignedServe improves the decoding throughput by a maximum of 1.98 and reduces the latency by up to 7.4 compared to the state-of-the-art systems.

Mark Helpful

Bookmark

Relay