What type of study is this?

September 10, 2025

Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines

Key Points

Benchmarking power consumption during inference of large language models shows energy efficiency variations among engines.
Power consumption is measured across GPU, CPU, and DRAM during the inference lifecycle stages of setup and token generation.
Findings indicate significant energy bottlenecks in specific components, emphasizing their impact on overall efficiency.
A comprehensive analysis provides insights into optimizing the performance of LLM inference engines.

Abstract

Large Language Models (LLMs) have remarkable advancements in recent years and have revolutionized the field of natural language processing. To reduce latency and improve inference throughput, many inference engines have been proposed such as vLLM, TensorRT-LLM, and DeepSpeed. However, there is no comprehensive analysis on the power consumption and energy efficiency of these inference engines. In this paper, we benchmark the power consumption of LLM inference engines on one single GPU node with 2 H100 GPUs and provide a fine-grained analysis by decomposing the inference lifecycle into two stages: the setup stage including engine initialization and model loading; and the token generation stage. For each stage, we further measure power consumption across key system components, including GPU, CPU, and DRAM. This breakdown analysis allows us to identify energy bottlenecks of inference lifecycle and gain deeper insights into the energy efficiency of modern inference engines.

Demander à l'IA

Bookmark