Modern cloud-native systems generate massive volumes of heterogeneous logs across services, containers, and infrastructure layers, making production incident investigation increasingly time-consuming and error-prone. Traditional log analysis workflows rely on keyword search, dashboards, and manual correlation, which often fail to capture semantic relationships across distributed components and lead to prolonged Mean Time to Resolution (MTTR). This paper presents an industry-deployed, agentic large language model (LLM)-assisted log analysis system designed to reduce MTTR in large-scale production environments. The system combines structured logging, semantic vector embeddings, and Retrieval-Augmented Generation (RAG) with an iterative agentic reasoning loop that models incident investigation as a hypothesis-driven process. Rather than performing one-shot inference, the system generates hypotheses, issues targeted follow-up queries, refines evidence, and produces grounded root-cause explanations with human-in-the-loop oversight. We describe the end-to-end architecture, including log ingestion and normalization, correlation-aware indexing, JSON-path flattening for structured payloads, semantic retrieval using approximate nearest-neighbor search, and guardrails for cost control and hallucination mitigation. Through representative production case studies, we demonstrate reductions in manual log inspection effort and significant improvements in time to initial hypothesis and overall MTTR compared to traditional workflows. This work highlights that effective operational use of LLMs depends less on model novelty and more on system design choices such as data quality, retrieval grounding, and constrained agentic reasoning. The paper concludes with lessons learned from production deployment and outlines future directions toward performance optimization, incremental domain adaptation, and safe extensions toward autonomous remediation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Saptarshi Niyogi
Building similarity graph...
Analyzing shared references across papers
Loading...
Saptarshi Niyogi (Tue,) studied this question.
www.synapsesocial.com/papers/6a0ff3d9d674f7c03778cb20 — DOI: https://doi.org/10.5281/zenodo.20301048