What question did this study set out to answer?

The aim is to develop an autonomous root cause analysis (RCA) system that leverages both deterministic algorithms and large language models (LLMs) for effective signal processing and hypothesis validation.

May 22, 2026Open Access

Automating Root Cause Analysis: An Agentic Framework for Evidence-Led Reasoning over Distributed System Observability

Key Points

The aim is to develop an autonomous root cause analysis (RCA) system that leverages both deterministic algorithms and large language models (LLMs) for effective signal processing and hypothesis validation.
The system utilizes a BFS-driven architecture for efficient signal fetching and correlation.
An agentic loop is employed for hypothesis validation with a maximum of 5 iterations to avoid runaway inference.
Evidence-led reasoning is implemented to ensure conclusions are based on verifiable causal relationships.
The RCA system effectively generates and validates hypotheses while maintaining high reliability through deterministic scoring.
It scales investigation depth based on incident size rather than graph complexity, enhancing practical usability.
The architecture enables seamless integration with observability tools like Prometheus and Jaeger.

Abstract

Overview A reference architecture and working implementation for autonomous root cause analysis (RCA) that combines deterministic signal processing with budgeted LLM reasoning. The system ingests Prometheus alerts, fetches metrics, logs, and traces in parallel, correlates signals deterministically, generates ranked hypotheses via LLM, and validates them with a budgeted agent loop (max 5 iterations). Core Innovation: The Evidence-Led Reasoning Engine BFS-Driven Adaptive Scope: The engine dynamically determines investigation depth by pruning healthy dependency branches and expanding only along suspect paths. This ensures investigation scales with the size of the incident, not the total graph size, mirroring expert SRE triage behavior. Hybrid Reasoning Architecture: Combines Deterministic Correlation (for auditable, hallucination-free signal analysis) with Budgeted LLM Reasoning (for flexible, interpretable hypothesis validation). By using deterministic scoring for the majority of RCA decisions, the system maintains high reliability while leveraging LLMs for complex synthesis. Observability Abstraction via MCP: Built on the Model Context Protocol (MCP), the system decouples reasoning from specific backends. Signal retrieval is handled via standard tool schemas, enabling modular integration with Prometheus, Jaeger, OpenSearch, and other observability providers. Key Technical Features No LLM in the Signal Path: To eliminate hallucination risks and ensure auditability, all signal fetching, normalization, and correlation are handled by deterministic algorithms. Budgeted Agentic Loop: LLM involvement is strictly confined to hypothesis generation and a bounded (max 5 iterations) validation loop. This prevents "runaway inference" and ensures predictable token costs and latency. Evidence-Led Correlation: Utilizes timing-based and topological scoring to build auditable causal relationships across microservices, ensuring every conclusion is backed by a verifiable chain of evidence. Architecture 9-node LangGraph pipeline: Nodes 1–3: Alert ingestion, LLM analysis, context loading (deterministic) Node 4: Graph traversal engine (BFS) with parallel signal fetching, normalization, health evaluation Node 5: Deterministic correlation engine (timing-based causal scoring) Node 6: LLM hypothesis generation Node 7: LLM hypothesis validator (budgeted: max 5 iterations) Node 8–9: Deterministic scoring + LLM narrative output Target Audience SRE & Platform Engineers: For building auditable, automated triage systems. AI Researchers: For exploring hybrid agentic architectures in high-stakes operational domains. Observability Practitioners: For implementing evidence-based RCA that scales with system complexity rather than graph size. Quick Start See Appendix B in the paper for a "How to Build Your Own" implementation guide. Full working implementation: https: //github. com/soul-bits/rca-agent Keywords Observability, Root Cause Analysis, SRE, AI Agents, Distributed Systems, Model Context Protocol (MCP), AIOps, LangGraph. Citation @techreportgupta2026rca, author = {Gupta, Achin and Mahajan, Divya, title = Automating Root Cause Analysis: How to Build an Observability System That Reasons, year = 2026, month = April, url = https: //zenodo. org/record/19720128, doi = 10. 5281/zenodo. 19720128, version = 1. 0 } License & Attribution This work is licensed under Creative Commons Attribution 4. 0 International (CC BY 4. 0). You may share and adapt for any purpose, including commercially, provided appropriate credit is given. If you build on these concepts in academic or professional work, please cite this paper. For details: https: //creativecommons. org/licenses/by/4. 0/ Authors & Contact Achin Gupta ORCID: https: //orcid. org/0009-0000-4268-9668 Email: guptaachin01@gmail. com LinkedIn: https: //linkedin. com/in/guptaachin Divya Mahajan ORCID: https: //orcid. org/0009-0000-1363-481X Email: dm. divya. mahajan@gmail. com LinkedIn: https: //linkedin. com/in/dm-divyamahajan For questions about the architecture, implementation, or evaluation, reach out to the authors directly.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper