What question did this study set out to answer?

This work aims to develop a log analysis system that leverages large language models to streamline incident investigation and reduce MTTR.

May 22, 2026Open Access

An Agentic LLM-Assisted Log Analysis System for MTTR Reduction in Cloud-Native Production Environments

Read Full Paperexternally

Key Points

This work aims to develop a log analysis system that leverages large language models to streamline incident investigation and reduce MTTR.
Developed an agentic LLM-assisted log analysis system deployed in production environments.
Utilized semantic vector embeddings and Retrieval-Augmented Generation for effective log analysis.
Incorporated human-in-the-loop oversight for refining hypotheses and explanations.
Achieved significant reductions in manual log inspection effort compared to traditional workflows.
Demonstrated improvement in time to initial hypothesis generation and overall MTTR.
Reduced MTTR highlighted the effectiveness of structured logging and systematic incident investigation.

Abstract

Modern cloud-native systems generate massive volumes of heterogeneous logs across services, containers, and infrastructure layers, making production incident investigation increasingly time-consuming and error-prone. Traditional log analysis workflows rely on keyword search, dashboards, and manual correlation, which often fail to capture semantic relationships across distributed components and lead to prolonged Mean Time to Resolution (MTTR). This paper presents an industry-deployed, agentic large language model (LLM)-assisted log analysis system designed to reduce MTTR in large-scale production environments. The system combines structured logging, semantic vector embeddings, and Retrieval-Augmented Generation (RAG) with an iterative agentic reasoning loop that models incident investigation as a hypothesis-driven process. Rather than performing one-shot inference, the system generates hypotheses, issues targeted follow-up queries, refines evidence, and produces grounded root-cause explanations with human-in-the-loop oversight. We describe the end-to-end architecture, including log ingestion and normalization, correlation-aware indexing, JSON-path flattening for structured payloads, semantic retrieval using approximate nearest-neighbor search, and guardrails for cost control and hallucination mitigation. Through representative production case studies, we demonstrate reductions in manual log inspection effort and significant improvements in time to initial hypothesis and overall MTTR compared to traditional workflows. This work highlights that effective operational use of LLMs depends less on model novelty and more on system design choices such as data quality, retrieval grounding, and constrained agentic reasoning. The paper concludes with lessons learned from production deployment and outlines future directions toward performance optimization, incremental domain adaptation, and safe extensions toward autonomous remediation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Saptarshi Niyogi

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

An Agentic LLM-Assisted Log Analysis System for MTTR Reduction in Cloud-Native Production Environments

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study