Today, many computing workloads are executed in loosely coupled, geographically distributed environments where resources are owned by different organizations. Examples include inter-institutional research infrastructures, community-operated clusters, and edge deployments. As disconnections are frequent in such environments, ensuring reliable task execution remains a fundamental challenge. Kubernetes, the de facto standard for cluster orchestration, provides centralized control and strong consistency, but suffers from slow recovery when node failures occur frequently. At the opposite extreme, blockchain-based orchestration removes centralized control but incurs substantial latency due to global consensus, making it unsuitable for time-sensitive task scheduling. This paper presents Mutual Cloud, a decentralized orchestration framework that operates between these two extremes. Mutual Cloud adopts a hybrid architecture where task admission and queue management are handled in a centralized manner similar to conventional public clouds, whereas most scheduling functions, including execution-node selection and failure handling, are performed in a decentralized manner by autonomous agents using a distributed hash table. We implement a prototype of Mutual Cloud and evaluate its performance through large-scale simulation studies. The results show that Mutual Cloud maintains stable performance comparable to centralized baselines under normal conditions while achieving approximately five-second-level recovery latency under substantial node failures.
Keum et al. (Thu,) studied this question.