What question did this study set out to answer?

This research aims to evaluate the trade-offs in performance, productivity, and portability of various machine learning compilers for large language models on NVIDIA GPUs.

May 17, 2026Open Access

Characterization of machine learning compilers for LLM inference on NVIDIA GPUs

Key Points

This research aims to evaluate the trade-offs in performance, productivity, and portability of various machine learning compilers for large language models on NVIDIA GPUs.
Evaluated PyTorch-based LLMs using prominent machine learning compilers including TensorRT, XLA, and ONNX Runtime.
Used synthetic PyTorch models for optimization isolation and end-to-end benchmarks with state-of-the-art models (TinyLlama-1.1B, Llama-2-7B).
Found that AOT compilation with architecture-specific tools like TensorRT-LLM offers peak performance for SOTA LLMs.
Noted that JIT solutions provide flexibility and compatibility with all tested models but don't consistently enhance performance.

Abstract

Abstract AI inference is conflicted between Performance, developer Productivity, and device Portability–the P3 problem. Machine learning compilers (MLCs) aim to address this, but their ecosystem is fragmented, with tools that each prioritize a different issue. This paper evaluates the deployment trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: , TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-the-Art (SOTA) models (TinyLlama-1.1B, Llama-2-7B) to measure real-world performance. Findings reveal that the peak performance of Ahead-Of-Time (AOT) compilation requires architecture-specific tools such as TensorRT-LLM, which are necessary for SOTA LLMs but are unusable for PyTorch models. As for Just-In-Time (JIT) solutions such as and its backends, they are flexible and portable, compatible with all tested models, but they do not consistently accelerate LLMs; therefore, the choice of MLC depends on P3 considerations and model architecture.

Bookmark

View Full Paper

Bookmark

View Full Paper

Characterization of machine learning compilers for LLM inference on NVIDIA GPUs

Key Points

Abstract

Cite This Study