What question did this study set out to answer?

The research aims to enhance the diagnostic accuracy of incident triage using large language models and retrieval-augmented generation.

April 13, 2026Open Access

Title: Enhancing Diagnostic Accuracy and Explainability of Large Language Model Agents in Production Incident Triage through Retrieval-Augmented Generation

Key Points

The research aims to enhance the diagnostic accuracy of incident triage using large language models and retrieval-augmented generation.
Integrated large language models with multi-source diagnostic information.
Utilized retrieval-augmented generation to ground outputs in domain-specific data.
Conducted empirical evaluations using a reinforcement learning environment across three scenarios.
Demonstrated improved root cause analysis accuracy in incident diagnosis.
Reduced cognitive load for on-call engineers during troubleshooting tasks.
Enhanced explainability of language model outputs while minimizing hallucinations.

Abstract

This work addresses challenges faced by on-call engineers in diagnosing cloud service incidents, focusing on limitations of traditional manual troubleshooting guides and single-source data reliance. It explores the integration of large language models (LLMs) with automated workflows that collect multi-source diagnostic information to improve root cause analysis accuracy and reduce cognitive load. Retrieval-augmented generation (RAG) is presented as a method to combine LLM generative capabilities with external knowledge retrieval, grounding outputs in up-to-date, domain-specific data to reduce hallucinations and improve explainability. An empirical evaluation is conducted using a reinforcement learning environment simulating production incident triage across three scenarios of increasing diagnostic complexity.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper