What type of study is this?

September 10, 2025Open Access

Accelerating Incident Response Using LLM-Based Retrieval-Augmented Generation Systems

Key Points

Automated recommendations from LLMs enhance incident response, reducing mean time to resolution significantly.
The integration of diverse data sources into a knowledge base provides context-aware remediation suggestions.
Controlled experiments showed successful autonomous resolution of recurring incident types in production environments.
The system bridges software engineering and applied AI, aiming to enhance reliability engineering in cloud services.

Abstract

Modern cloud systems generate vast amounts of operational data, yet triaging incidents and identifying root causes remains a manual and time-consuming task. In this article, it proposes a novel approach to automate incident diagnosis and resolution using Retrieval-Augmented Generation (RAG), a system that combines Large Language Models (LLMs) with a domain-specific knowledge base built from code artifacts, logs, documentation, and historical tickets. Our system indexes these heterogeneous data sources into a vector database, allowing LLMs to retrieve semantically relevant context before generating responses. This architecture enables the LLM to understand new backend system errors as they occur and to provide actionable, context-aware remediation suggestions. By continuously ingesting updated artifacts, such as deployment logs, API traces, and recently resolved incidents, the knowledge base evolves in real time, improving the accuracy and relevance of automated recommendations. To demonstrate how this system reduces mean time to resolution by preemptively identifying root causes and offering fixes, without requiring human escalation. In controlled experiments in production environments, our prototype successfully resolved a significant portion of recurring incident types autonomously. This work bridges software engineering, operations, and applied AI, providing a blueprint for deploying LLM-powered observability tools that significantly enhance reliability engineering and reduce customer impact in critical cloud services.

Accelerating Incident Response Using LLM-Based Retrieval-Augmented Generation Systems

Key Points

Abstract

Cite This Study

Also Consider

Also Consider