What question did this study set out to answer?

The aim is to optimize batching efficiency in embedding and matrix-multiplication workloads while managing latency.

April 3, 2026Open Access

Razor's Edge: Throughput-Optimized Dynamic Batching with Latency Objectives

Key Points

The aim is to optimize batching efficiency in embedding and matrix-multiplication workloads while managing latency.
Developed a batching scheduler utilizing dynamic programming for throughput optimization.
Implemented various next-batch selection strategies including FIFO, MINMAX, and GUARDED BATCH SIZE.
Created model benchmarking methods using direct measurements on deployed hardware.
Analyzed batching efficiency through visualization of contour plots.
Achieved a 47% increase in throughput on a CPU embedding workload.
Obtained a 26% throughput boost on a GPU embedding workload.
Demonstrated the ability to control latency and throughput trade-offs effectively.

Abstract

Serving systems for embedding, LLM, and other matrix-multiplication-dominated inference workloads relyon batching for efficient hardware utilization. We observe that batching efficiency exhibits a sharp input-size-dependent structure driven by the transition between memory-bound and compute-bound regimes: small inputscan be batched flexibly across heterogeneous sizes, while large inputs require near-uniformity, leading to arapid collapse in batching efficiency. This produces a characteristic blade-like (”razor's edge”) shape in thebatch performance landscape.We present the Razor's Edge batching scheduler, a practical framework that combines (i) dynamic-programming-based throughput optimization over sorted requests, (ii) production-oriented next-batch selection strategies(FIFO, MINMAX, and GUARDED BATCH SIZE), and (iii) startup-time-efficient model benchmarking thatbuilds batch timing estimators from direct measurements on the same hardware where the model is deployed.A central novelty claim in this paper is this measurement-to-optimizer bridge: instead of relying on ana-lytic proxy cost models, we benchmark the deployed model/hardware pair and feed those empirical timingsdirectly into the DP cost table used for scheduling decisions. We also introduce a practical visualizationmethod for quantifying batching efficiency improvements when expanding the allowed maximum batch sizefrom (N-1) to (N), producing the characteristic ”razor's edge” contour plots. The approach is designed forreal-time online serving with queueing. Our claims are scoped to ”ahead-of-time variable-size batchingfor encoder-style inference” evaluated in this paper, not to universal superiority across all serving stacks.We demonstrate the scheduler's efficacy through a 47% throughput increase on a CPU embedding work-load (jina-embeddings-v2-base-en), a 26% throughput increase on a GPU embedding workload(BAAI/bge-m3), and controllable latency/throughput trade-offs across the final strategy set.

Razor's Edge: Throughput-Optimized Dynamic Batching with Latency Objectives

Key Points

Abstract

Cite This Study