Serving systems for embedding, LLM, and other matrix-multiplication-dominated inference workloads relyon batching for efficient hardware utilization. We observe that batching efficiency exhibits a sharp input-size-dependent structure driven by the transition between memory-bound and compute-bound regimes: small inputscan be batched flexibly across heterogeneous sizes, while large inputs require near-uniformity, leading to arapid collapse in batching efficiency. This produces a characteristic blade-like (”razor's edge”) shape in thebatch performance landscape.We present the Razor's Edge batching scheduler, a practical framework that combines (i) dynamic-programming-based throughput optimization over sorted requests, (ii) production-oriented next-batch selection strategies(FIFO, MINMAX, and GUARDED BATCH SIZE), and (iii) startup-time-efficient model benchmarking thatbuilds batch timing estimators from direct measurements on the same hardware where the model is deployed.A central novelty claim in this paper is this measurement-to-optimizer bridge: instead of relying on ana-lytic proxy cost models, we benchmark the deployed model/hardware pair and feed those empirical timingsdirectly into the DP cost table used for scheduling decisions. We also introduce a practical visualizationmethod for quantifying batching efficiency improvements when expanding the allowed maximum batch sizefrom (N-1) to (N), producing the characteristic ”razor's edge” contour plots. The approach is designed forreal-time online serving with queueing. Our claims are scoped to ”ahead-of-time variable-size batchingfor encoder-style inference” evaluated in this paper, not to universal superiority across all serving stacks.We demonstrate the scheduler's efficacy through a 47% throughput increase on a CPU embedding work-load (jina-embeddings-v2-base-en), a 26% throughput increase on a GPU embedding workload(BAAI/bge-m3), and controllable latency/throughput trade-offs across the final strategy set.
Building similarity graph...
Analyzing shared references across papers
Loading...
Arrman Anicket Saha
Building similarity graph...
Analyzing shared references across papers
Loading...
Arrman Anicket Saha (Tue,) studied this question.
synapsesocial.com/papers/69cf5ede5a333a821460d811 — DOI: https://doi.org/10.5281/zenodo.19360137