Abstract AI inference is conflicted between Performance, developer Productivity, and device Portability—the P3 problem. Machine Learning Compilers (MLCs) aim to resolve this, but their ecosystem is fragmented, with tools each prioritizing one issue. This paper evaluates deploying trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: torch.compile, TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-The-Art (SoTA) models (TinyLlama-1.1B, Llama-2-7B) to measure real-world performance. Findings reveal that Ahead-Of-Time (AOT) compilation's peak performance requires architecture-specific tools like TensorRT-LLM, necessary for SoTA LLMs but unusable for PyTorch models. As for Just-In-Time (JIT) solutions like torch.compile and its backends, they prove flexible and portable, compatible with all tested models but unable to accelerate LLMs consistently, therefore, the choice of MLC depends on P3 considerations and model architecture.
Carmona et al. (Fri,) studied this question.