AbstractThis paper reviews an advanced single-version fault-tolerant computing framework that leverages artificial intelligence (AI) to significantly boost fault resilience in distributed systems. Expanding on the foundational software redundancy approach 1, the new architecture integrates agent-based monitoring, predictive machine learning, reinforcement learning, and federated fault modeling. The fault detection and recovery processes are detailed through algorithms and flowcharts. Experimental results in smart grids, robotics, and edge computing showcase enhanced prediction accuracy, autonomous fault recovery, and greater scalability. This solution provides a cost-efficient, intelligent alternative to conventional multiversion and hardware-dependent fault tolerance methods.
Goutam Saha (Wed,) studied this question.