With the rapid evolution of cloud-native platforms, microservice-based systems have become increasingly large-scale and complex, making fast and accurate root cause localization and recovery a critical challenge. Runtime signals in such systems are inherently multimodal—combining metrics, logs, and traces—and are intertwined through deep, dynamic service dependencies, which often leads to noisy alerts, ambiguous fault propagation paths, and brittle, manually curated recovery playbooks. To address these issues, we propose GALR, a graph- and LLM-based framework for root cause localization and recovery in microservice-based business middle platforms. GALR first constructs a multimodal service call graph by fusing time-series metrics, structured logs, and trace-derived topology, and employs a GAT-based root cause analysis module with temporal-aware edge attention to model failure propagation. On top of this, an LLM-based node enhancement mechanism infers anomaly, normal, and uncertainty scores from log contexts and injects them into node representations and attention bias terms, improving robustness under noisy or incomplete signals. Finally, GALR integrates a retrieval-augmented LLM agent that retrieves similar historical cases and generates executable recovery strategies, with consistency checking against expert-standard playbooks to ensure safety and reproducibility. Extensive experiments on three representative microservice datasets demonstrate that GALR consistently achieves superior Top-k accuracy and mean reciprocal rank for root cause localization, while the retrieval-augmented agent yields substantially more accurate and actionable recovery plans compared with graph-only and LLM-only baselines, providing a practical closed-loop solution from anomaly perception to recovery execution.
Zhang et al. (Mon,) studied this question.