April 19, 2024

The Importance of Workload Choice in Evaluating LLM Inference Systems

Key Points

Key points are not available for this paper at this time.

Abstract

The success of Large Language Models (LLMs) across a wide range of applications and use cases has created the need for faster and more scalable systems for LLM inference. These systems speed up LLM inference by optimizing scheduling decisions or efficiently managing the available memory. However, most of them use synthetic datasets and target latency-critical scenarios in their evaluation, thereby overlooking a considerable part of real-world use cases and workloads. As a response, this paper presents an extensive experimental evaluation that aims to capture the impact of the workload used for evaluation and quantify the benefit derived from higher memory availability. Our analysis shows that LLMs can achieve 3× higher throughput for text generation and question-answering use cases compared to text summarization and conversational ones. The latter ones seem to exhibit low levels of performance due to their demanding input sizes. In addition, non-latency-critical inference services achieve 2.3× higher throughput when 4× more memory is available. In conclusion, this paper aims to highlight the importance and impact of the chosen workloads in the evaluation of systems for LLM inference.

اسأل الذكاء الاصطناعي

Bookmark

Cite This Study

Papaioannou et al. (Fri,) studied this question.

synapsesocial.com/papers/68e6e666b6db643587661ae5 https://doi.org/https://doi.org/10.1145/3642970.3655823

اسأل الذكاء الاصطناعي

Bookmark