What type of study is this?

September 10, 2025

Efficient LLM Inference via Chunked Prefills

Key Points

Chunked prefilling significantly improves throughput while maintaining low latency, enabling efficient LLM inference.
The approach mitigates latency jitter and minimizes generation stalls by interleaving prefill computations with decoding.
Existing scheduling strategies often lead to tradeoffs between system throughput and request latency, which chunked prefills aim to resolve.
Enhancing LLM inference with this method can improve pipeline efficiency in distributed deployments, ensuring responsive performance.

Abstract

Large Language Model (LLM) inference serving faces a fundamental challenge due to the distinct characteristics of its two phases: compute-intensive pre fill and memory-intensive decode. Existing scheduling strategies often prioritize one phase over the other, leading to a difficult tradeoff between system throughput and request latency. Prefill-prioritizing schedulers improve throughput but introduce significant latency jitter (generation stalls) by interfering with ongoing decodes. Conversely, decode-prioritizing schedulers maintain low latency but underutilize GPU resources, resulting in low throughput. This paper revisits the technique of chunked prefills, demonstrating its efficacy in mitigating this tradeoff. By splitting large prefill computations into smaller, manageable chunks and interleaving them with decode operations using stall-free batching, we can leverage the compute slack inherent in the decode phase. This approach significantly improves serving capacity under strict latency constraints, minimizes generation stalls, and reduces pipeline bubbles in distributed deployments, enabling efficient and responsive inference.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ashutosh Agrawal

University of Houston

Nitin Kedia

Cisco College

Ashish Panwar

Microsoft Research (India)

Journals

ACM SIGOPS Operating Systems Review

Actions

Institutions

Georgia Institute of Technology

Microsoft Research (India)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Efficient LLM Inference via Chunked Prefills

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study