This work addresses challenges faced by on-call engineers in diagnosing cloud service incidents, focusing on limitations of traditional manual troubleshooting guides and single-source data reliance. It explores the integration of large language models (LLMs) with automated workflows that collect multi-source diagnostic information to improve root cause analysis accuracy and reduce cognitive load. Retrieval-augmented generation (RAG) is presented as a method to combine LLM generative capabilities with external knowledge retrieval, grounding outputs in up-to-date, domain-specific data to reduce hallucinations and improve explainability. An empirical evaluation is conducted using a reinforcement learning environment simulating production incident triage across three scenarios of increasing diagnostic complexity.
Samnit Mehandiratta (Sat,) studied this question.