June 17, 2007

Proactive fault tolerance for HPC with Xen virtualization

Key Points

Key points are not available for this paper at this time.

Abstract

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.

Proactive fault tolerance for HPC with Xen virtualization

Key Points

Abstract

Cite This Study