Modern cloud systems generate vast amounts of operational data, yet triaging incidents and identifying root causes remains a manual and time-consuming task. In this article, it proposes a novel approach to automate incident diagnosis and resolution using Retrieval-Augmented Generation (RAG), a system that combines Large Language Models (LLMs) with a domain-specific knowledge base built from code artifacts, logs, documentation, and historical tickets. Our system indexes these heterogeneous data sources into a vector database, allowing LLMs to retrieve semantically relevant context before generating responses. This architecture enables the LLM to understand new backend system errors as they occur and to provide actionable, context-aware remediation suggestions. By continuously ingesting updated artifacts, such as deployment logs, API traces, and recently resolved incidents, the knowledge base evolves in real time, improving the accuracy and relevance of automated recommendations. To demonstrate how this system reduces mean time to resolution by preemptively identifying root causes and offering fixes, without requiring human escalation. In controlled experiments in production environments, our prototype successfully resolved a significant portion of recurring incident types autonomously. This work bridges software engineering, operations, and applied AI, providing a blueprint for deploying LLM-powered observability tools that significantly enhance reliability engineering and reduce customer impact in critical cloud services.
Akash Goel (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: