Overview A reference architecture and working implementation for autonomous root cause analysis (RCA) that combines deterministic signal processing with budgeted LLM reasoning. The system ingests Prometheus alerts, fetches metrics, logs, and traces in parallel, correlates signals deterministically, generates ranked hypotheses via LLM, and validates them with a budgeted agent loop (max 5 iterations). Core Innovation: The Evidence-Led Reasoning Engine BFS-Driven Adaptive Scope: The engine dynamically determines investigation depth by pruning healthy dependency branches and expanding only along suspect paths. This ensures investigation scales with the size of the incident, not the total graph size, mirroring expert SRE triage behavior. Hybrid Reasoning Architecture: Combines Deterministic Correlation (for auditable, hallucination-free signal analysis) with Budgeted LLM Reasoning (for flexible, interpretable hypothesis validation). By using deterministic scoring for the majority of RCA decisions, the system maintains high reliability while leveraging LLMs for complex synthesis. Observability Abstraction via MCP: Built on the Model Context Protocol (MCP), the system decouples reasoning from specific backends. Signal retrieval is handled via standard tool schemas, enabling modular integration with Prometheus, Jaeger, OpenSearch, and other observability providers. Key Technical Features No LLM in the Signal Path: To eliminate hallucination risks and ensure auditability, all signal fetching, normalization, and correlation are handled by deterministic algorithms. Budgeted Agentic Loop: LLM involvement is strictly confined to hypothesis generation and a bounded (max 5 iterations) validation loop. This prevents "runaway inference" and ensures predictable token costs and latency. Evidence-Led Correlation: Utilizes timing-based and topological scoring to build auditable causal relationships across microservices, ensuring every conclusion is backed by a verifiable chain of evidence. Architecture 9-node LangGraph pipeline: Nodes 1–3: Alert ingestion, LLM analysis, context loading (deterministic) Node 4: Graph traversal engine (BFS) with parallel signal fetching, normalization, health evaluation Node 5: Deterministic correlation engine (timing-based causal scoring) Node 6: LLM hypothesis generation Node 7: LLM hypothesis validator (budgeted: max 5 iterations) Node 8–9: Deterministic scoring + LLM narrative output Target Audience SRE & Platform Engineers: For building auditable, automated triage systems. AI Researchers: For exploring hybrid agentic architectures in high-stakes operational domains. Observability Practitioners: For implementing evidence-based RCA that scales with system complexity rather than graph size. Quick Start See Appendix B in the paper for a "How to Build Your Own" implementation guide. Full working implementation: https: //github. com/soul-bits/rca-agent Keywords Observability, Root Cause Analysis, SRE, AI Agents, Distributed Systems, Model Context Protocol (MCP), AIOps, LangGraph. Citation @techreportgupta2026rca, author = {Gupta, Achin and Mahajan, Divya, title = Automating Root Cause Analysis: How to Build an Observability System That Reasons, year = 2026, month = April, url = https: //zenodo. org/record/19720128, doi = 10. 5281/zenodo. 19720128, version = 1. 0 } License & Attribution This work is licensed under Creative Commons Attribution 4. 0 International (CC BY 4. 0). You may share and adapt for any purpose, including commercially, provided appropriate credit is given. If you build on these concepts in academic or professional work, please cite this paper. For details: https: //creativecommons. org/licenses/by/4. 0/ Authors & Contact Achin Gupta ORCID: https: //orcid. org/0009-0000-4268-9668 Email: guptaachin01@gmail. com LinkedIn: https: //linkedin. com/in/guptaachin Divya Mahajan ORCID: https: //orcid. org/0009-0000-1363-481X Email: dm. divya. mahajan@gmail. com LinkedIn: https: //linkedin. com/in/dm-divyamahajan For questions about the architecture, implementation, or evaluation, reach out to the authors directly.
Gupta et al. (Wed,) studied this question.