Serving systems for embedding, LLM, and other matrix-multiplication-dominated inference workloads relyon batching for efficient hardware utilization. We observe that batching efficiency exhibits a sharp input-size-dependent structure driven by the transition between memory-bound and compute-bound regimes: small inputscan be batched flexibly across heterogeneous sizes, while large inputs require near-uniformity, leading to arapid collapse in batching efficiency. This produces a characteristic blade-like (”razor's edge”) shape in thebatch performance landscape.We present the Razor's Edge batching scheduler, a practical framework that combines (i) dynamic-programming-based throughput optimization over sorted requests, (ii) production-oriented next-batch selection strategies(FIFO, MINMAX, and GUARDED BATCH SIZE), and (iii) startup-time-efficient model benchmarking thatbuilds batch timing estimators from direct measurements on the same hardware where the model is deployed.A central novelty claim in this paper is this measurement-to-optimizer bridge: instead of relying on ana-lytic proxy cost models, we benchmark the deployed model/hardware pair and feed those empirical timingsdirectly into the DP cost table used for scheduling decisions. We also introduce a practical visualizationmethod for quantifying batching efficiency improvements when expanding the allowed maximum batch sizefrom (N-1) to (N), producing the characteristic ”razor's edge” contour plots. The approach is designed forreal-time online serving with queueing. Our claims are scoped to ”ahead-of-time variable-size batchingfor encoder-style inference” evaluated in this paper, not to universal superiority across all serving stacks.We demonstrate the scheduler's efficacy through a 47% throughput increase on a CPU embedding work-load (jina-embeddings-v2-base-en), a 26% throughput increase on a GPU embedding workload(BAAI/bge-m3), and controllable latency/throughput trade-offs across the final strategy set.
Arrman Anicket Saha (Tue,) studied this question.