What question did this study set out to answer?

The aim is to optimize batching efficiency in embedding and matrix-multiplication workloads while managing latency.

April 3, 2026Open Access

Razor's Edge: Throughput-Optimized Dynamic Batching with Latency Objectives

Key Points

The aim is to optimize batching efficiency in embedding and matrix-multiplication workloads while managing latency.
Developed a batching scheduler utilizing dynamic programming for throughput optimization.
Implemented various next-batch selection strategies including FIFO, MINMAX, and GUARDED BATCH SIZE.
Created model benchmarking methods using direct measurements on deployed hardware.
Analyzed batching efficiency through visualization of contour plots.
Achieved a 47% increase in throughput on a CPU embedding workload.
Obtained a 26% throughput boost on a GPU embedding workload.
Demonstrated the ability to control latency and throughput trade-offs effectively.

Abstract

Serving systems for embedding, LLM, and other matrix-multiplication-dominated inference workloads relyon batching for efficient hardware utilization. We observe that batching efficiency exhibits a sharp input-size-dependent structure driven by the transition between memory-bound and compute-bound regimes: small inputscan be batched flexibly across heterogeneous sizes, while large inputs require near-uniformity, leading to arapid collapse in batching efficiency. This produces a characteristic blade-like (”razor's edge”) shape in thebatch performance landscape.We present the Razor's Edge batching scheduler, a practical framework that combines (i) dynamic-programming-based throughput optimization over sorted requests, (ii) production-oriented next-batch selection strategies(FIFO, MINMAX, and GUARDED BATCH SIZE), and (iii) startup-time-efficient model benchmarking thatbuilds batch timing estimators from direct measurements on the same hardware where the model is deployed.A central novelty claim in this paper is this measurement-to-optimizer bridge: instead of relying on ana-lytic proxy cost models, we benchmark the deployed model/hardware pair and feed those empirical timingsdirectly into the DP cost table used for scheduling decisions. We also introduce a practical visualizationmethod for quantifying batching efficiency improvements when expanding the allowed maximum batch sizefrom (N-1) to (N), producing the characteristic ”razor's edge” contour plots. The approach is designed forreal-time online serving with queueing. Our claims are scoped to ”ahead-of-time variable-size batchingfor encoder-style inference” evaluated in this paper, not to universal superiority across all serving stacks.We demonstrate the scheduler's efficacy through a 47% throughput increase on a CPU embedding work-load (jina-embeddings-v2-base-en), a 26% throughput increase on a GPU embedding workload(BAAI/bge-m3), and controllable latency/throughput trade-offs across the final strategy set.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Arrman Anicket Saha

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Razor's Edge: Throughput-Optimized Dynamic Batching with Latency Objectives

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study