April 27, 2026Open Access

Where the Time Goes: Analysis of a Public LLM Serving System

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

In this study, we present a characterization of serving traces collected from Public Al's serving of Apertus, an open source Large Language Model (LLM). The trace spans roughly five months (September 2025-January 2026) and contains 337K requests. We analyzed request sizes, token and timing behaviour, latency, model-size effects, and temporal patterns. Our findings show insights that do not align with common assumptions; (1) time-to-first-token is often driven by queuing rather than prefill compute, especially for small requests; (2) the 8B and 70B models show nearly the same user-perceived latency despite a 9× parameter gap; (3) a substantial fraction of requests are prefill/queuing-dominated rather than decode-dominated; and; (4) observable input features are weak predictors of output, which makes size-aware scheduling difficult at arrival time. As a contribution to the research community, we will publish this anonymized trace along with its analysis.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo