Los puntos clave no están disponibles para este artículo en este momento.
In this study, we present a characterization of serving traces collected from Public Al's serving of Apertus, an open source Large Language Model (LLM). The trace spans roughly five months (September 2025-January 2026) and contains 337K requests. We analyzed request sizes, token and timing behaviour, latency, model-size effects, and temporal patterns. Our findings show insights that do not align with common assumptions; (1) time-to-first-token is often driven by queuing rather than prefill compute, especially for small requests; (2) the 8B and 70B models show nearly the same user-perceived latency despite a 9× parameter gap; (3) a substantial fraction of requests are prefill/queuing-dominated rather than decode-dominated; and; (4) observable input features are weak predictors of output, which makes size-aware scheduling difficult at arrival time. As a contribution to the research community, we will publish this anonymized trace along with its analysis.
Demiray et al. (Mon,) studied this question.