Enterprise process automation systems are more and more exposed to internal control deficiencies due to misconfigurations, resource bottlenecks, and software anomalies. The traditional fault detection and recovery systems are not scalable, interpretable, and not capable of providing real-time responses; hence, they cannot meet the requirements of the current cloud-based environments. The proposed methodology consists of three components: The predictive fault prediction by the bidirectional long short-term memory (Bi-LSTM) networks, the SHapley Additive exPlanations (SHAP)-based interpretability-based root cause analysis (RCA), and a hybrid self-healing engine which uses both rule-based logic and reinforcement learning (RL) for its operation. The whole setup is trained and tested on the Aliyun Cloud Fault Dataset, where detailed temporal and structural fault traces are provided from large-scale enterprise cloud clusters. The proposed solution involves a series of complex preprocessing techniques including KNN imputation, Min–Max scaling, and one-hot encoding which is then followed by statistical, temporal, event-pattern, and graph-based dimensions feature extraction. The Bi-LSTM model captures both forward and backward temporal dependencies that culminate in precise fault classification and severity scoring. To facilitate this, SHAP ranks the features by their contribution and the self-healing engine can run corrective actions automatically and also learns about new fault conditions through RL feedback loops. The results of the experiments show excellent fault detection accuracy (98%), precision (97%), recall (99%), and healing success rate (90%). The RL agent shows rapid convergence and generalization across different fault episodes. The system offers an auditable, adaptive, and scalable enterprise fault management solution that significantly reduces downtime and human effort. In the future, the solution will be extended to support multi-cloud environments and TinyML agents for edge deployment will be implemented.
Wang et al. (Wed,) studied this question.