This article presents an autonomous resilience framework for cloud computing environments that leverages reinforcement learning (RL) to enable self-healing capabilities. The framework embeds intelligent agents throughout the cloud stack to continuously monitor system health, detect anomalies, and automatically implement remediation actions without human intervention. Drawing inspiration from biological self-healing systems, the approach creates a distributed intelligence architecture that transforms cloud management from reactive to proactive operations. The system employs a comprehensive simulation environment for training RL agents, a carefully engineered multi-dimensional reward function, and a hierarchical decision-making framework. Extensive evaluation through both simulation and real-world testbed experiments demonstrates significant improvements in incident detection and recovery times, root cause identification accuracy, service availability during attacks, and overall operational efficiency. The framework exhibits emergent adaptive behaviors, including anticipatory actions that preemptively address potential failures before they impact service delivery, representing a paradigm shift in cloud infrastructure resilience.
Rakesh Kumar Gouri Neni (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: