Large Language Models (LLMs) have remarkable advancements in recent years and have revolutionized the field of natural language processing. To reduce latency and improve inference throughput, many inference engines have been proposed such as vLLM, TensorRT-LLM, and DeepSpeed. However, there is no comprehensive analysis on the power consumption and energy efficiency of these inference engines. In this paper, we benchmark the power consumption of LLM inference engines on one single GPU node with 2 H100 GPUs and provide a fine-grained analysis by decomposing the inference lifecycle into two stages: the setup stage including engine initialization and model loading; and the token generation stage. For each stage, we further measure power consumption across key system components, including GPU, CPU, and DRAM. This breakdown analysis allows us to identify energy bottlenecks of inference lifecycle and gain deeper insights into the energy efficiency of modern inference engines.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chenxu Niu
Wei Zhang
Yongjian Zhao
ACM SIGEnergy Energy Informatics Review
Lawrence Berkeley National Laboratory
Texas Tech University
Building similarity graph...
Analyzing shared references across papers
Loading...
Niu et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68c1b81f54b1d3bfb60ec7d1 — DOI: https://doi.org/10.1145/3757892.3757900