Traditional Cloud Disaster Recovery (CDR) adopts a “fail-first, then recover” paradigm, leading to prolonged Recovery Time Objective (RTO) and Recovery Point Objective (RPO) that no longer meet enterprises’ stringent service continuity requirements. To address this gap, this paper proposes a CDR framework based on time series prediction (CDR-TSP) with a core Encoder-Decoder Long Short-Term Memory (ED-LSTM) model. The ED-LSTM model captures nonlinear characteristics and long-term dependencies in time-series monitoring data (e.g., Central Processing Unit utilization, network latency) to enable proactive fault perception. CDR-TSP integrates a primary-standby architecture, a fault perception system, and a programmable workflow engine to realize automated pre-failure protection and rapid resource orchestration. Extensive experiments are conducted on 5 open-source datasets and 2 real-world cloud production datasets. Results show that ED-LSTM outperforms traditional statistical models and deep learning methods in prediction accuracy, achieving the lowest Mean Squared Error (MSE) and Mean Absolute Error (MAE) across all datasets. In disaster recovery tests with four typical fault scenarios (downtime, network disconnection, resource exhaustion, periodic downtime), CDR-TSP reduces RTO by 65%–90% compared with manual recovery, and achieves near-zero RTO for periodic faults. This work shifts CDR from passive to active protection, significantly enhancing the disaster recovery resilience of business systems.
Meng et al. (Thu,) studied this question.