October 12, 2025Open Access

Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

Key Points

Findings show architecture-specific tools like TensorRT-LLM are essential for peak performance of state-of-the-art LLMs.
Performance evaluation indicates that Just-In-Time solutions like torch.compile offer portability but lack consistent acceleration.
Study utilizes synthetic PyTorch models and real-world benchmarks with TinyLlama-1.1B and Llama-2-7B to measure outcomes.
Choice of machine learning compiler strategies should align with performance, productivity, and portability considerations.

Abstract

Abstract AI inference is conflicted between Performance, developer Productivity, and device Portability—the P3 problem. Machine Learning Compilers (MLCs) aim to resolve this, but their ecosystem is fragmented, with tools each prioritizing one issue. This paper evaluates deploying trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: torch.compile, TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-The-Art (SoTA) models (TinyLlama-1.1B, Llama-2-7B) to measure real-world performance. Findings reveal that Ahead-Of-Time (AOT) compilation's peak performance requires architecture-specific tools like TensorRT-LLM, necessary for SoTA LLMs but unusable for PyTorch models. As for Just-In-Time (JIT) solutions like torch.compile and its backends, they prove flexible and portable, compatible with all tested models but unable to accelerate LLMs consistently, therefore, the choice of MLC depends on P3 considerations and model architecture.

Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

Key Points

Abstract

Cite This Study