Abstract AI inference is conflicted between Performance, developer Productivity, and device Portability–the P3 problem. Machine learning compilers (MLCs) aim to address this, but their ecosystem is fragmented, with tools that each prioritize a different issue. This paper evaluates the deployment trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: , TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-the-Art (SOTA) models (TinyLlama-1.1B, Llama-2-7B) to measure real-world performance. Findings reveal that the peak performance of Ahead-Of-Time (AOT) compilation requires architecture-specific tools such as TensorRT-LLM, which are necessary for SOTA LLMs but are unusable for PyTorch models. As for Just-In-Time (JIT) solutions such as and its backends, they are flexible and portable, compatible with all tested models, but they do not consistently accelerate LLMs; therefore, the choice of MLC depends on P3 considerations and model architecture.
Carmona-Martínez et al. (Fri,) studied this question.