Key points are not available for this paper at this time.
The success of Large Language Models (LLMs) across a wide range of applications and use cases has created the need for faster and more scalable systems for LLM inference. These systems speed up LLM inference by optimizing scheduling decisions or efficiently managing the available memory. However, most of them use synthetic datasets and target latency-critical scenarios in their evaluation, thereby overlooking a considerable part of real-world use cases and workloads. As a response, this paper presents an extensive experimental evaluation that aims to capture the impact of the workload used for evaluation and quantify the benefit derived from higher memory availability. Our analysis shows that LLMs can achieve 3× higher throughput for text generation and question-answering use cases compared to text summarization and conversational ones. The latter ones seem to exhibit low levels of performance due to their demanding input sizes. In addition, non-latency-critical inference services achieve 2.3× higher throughput when 4× more memory is available. In conclusion, this paper aims to highlight the importance and impact of the chosen workloads in the evaluation of systems for LLM inference.
Building similarity graph...
Analyzing shared references across papers
Loading...
Papaioannou et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e6e666b6db643587661ae5 — DOI: https://doi.org/10.1145/3642970.3655823
Konstantinos Papaioannou
Thaleia Dimitra Doudali
Universidad Politécnica de Madrid
IMDEA Software
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: