What question did this study set out to answer?

The aim is to diagnose and suggest repairs for defects in large language model inference engines using a static approach.

February 16, 2026Open Access

Reliability of LLM Inference Engines from a Static Perspective: Root Cause Analysis and Repair Suggestion via Natural Language Reports

Key Points

The aim is to diagnose and suggest repairs for defects in large language model inference engines using a static approach.
Developed a real-world defect dataset from issue reports and developer discussions.
Annotated issues with semantic root cause categories and affected modules.
Implemented a framework for root cause classification and module localization without running code.
Integrated structured repair patterns with a large language model to suggest repairs.
Achieved effective root cause identification and module localization even with limited data.
Showed promising generalization to TensorRT-LLM in a cross-engine evaluation.
Confirmed by human evaluation that repair suggestions are correct and useful.

Abstract

Large Language Model (LLM) inference engines are becoming critical system infrastructure, yet their increasing architectural complexity makes defects difficult to be diagnosed and repaired. Existing reliability studies predominantly focus on model behavior or training frameworks, leaving inference engine bugs underexplored, especially in settings where execution-based debugging is impractical. We present a static, issue-centric approach for automated root cause analysis and repair suggestion generation for LLM inference engines. Based solely on issue reports and developer discussions, we construct a real-world defect dataset and annotate each issue with a semantic root cause category and affected system module. Leveraging text-based representations, our framework performs root cause classification and coarse-grained module localization without requiring code execution or specialized runtime environments. We further integrate structured repair patterns with a large language model to generate interpretable and actionable repair suggestions. Experiments on real-world issues concerning vLLMs demonstrate that our approach achieves effective root cause identification and module localization under limited and imbalanced data. A cross-engine evaluation further shows promising generalization to TensorRT-LLM. Human evaluation confirms that the generated repair suggestions are correct, useful, and clearly expressed. Our results indicate that static, issue-level analysis is a viable foundation for scalable debugging assistance in LLM inference engines. This work highlights the feasibility of static, issue-level defect analysis for complex LLM inference engines and explores automated debugging assistance techniques. The dataset and implementation will be publicly released to facilitate future research.

Mark Helpful

Bookmark

Relay

View Full Paper