Large Language Model (LLM) inference serving faces a fundamental challenge due to the distinct characteristics of its two phases: compute-intensive pre fill and memory-intensive decode. Existing scheduling strategies often prioritize one phase over the other, leading to a difficult tradeoff between system throughput and request latency. Prefill-prioritizing schedulers improve throughput but introduce significant latency jitter (generation stalls) by interfering with ongoing decodes. Conversely, decode-prioritizing schedulers maintain low latency but underutilize GPU resources, resulting in low throughput. This paper revisits the technique of chunked prefills, demonstrating its efficacy in mitigating this tradeoff. By splitting large prefill computations into smaller, manageable chunks and interleaving them with decode operations using stall-free batching, we can leverage the compute slack inherent in the decode phase. This approach significantly improves serving capacity under strict latency constraints, minimizes generation stalls, and reduces pipeline bubbles in distributed deployments, enabling efficient and responsive inference.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ashutosh Agrawal
University of Houston
Nitin Kedia
Cisco College
Ashish Panwar
Microsoft Research (India)
ACM SIGOPS Operating Systems Review
Georgia Institute of Technology
Microsoft Research (India)
Building similarity graph...
Analyzing shared references across papers
Loading...
Agrawal et al. (Mon,) studied this question.
synapsesocial.com/papers/68c1b81f54b1d3bfb60ec619 — DOI: https://doi.org/10.1145/3759441.3759444