Key points are not available for this paper at this time.
We present a new technique for enabling computations to survive errors and faults while providing a bound on any resulting output distortion. A developer using the technique first partitions the computation into tasks. The execution platform then simply discards any task that encounters an error or a fault and completes the computation by executing any remaining tasks. This technique can substantially improve the robustness of the computation in the face of errors and faults. A potential concern is that discarding tasks may change the result that the computation produces.Our technique randomly samples executions of the program at varying task failure rates to obtain a quantitative, probabilistic model that characterizes the distortion of the output as a function of the task failure rates. By providing probabilistic bounds on the distortion, the model allows users to confidently accept results produced by executions with failures as long as the distortion falls within acceptable bounds. This approach may prove to be especially useful for enabling computations to successfully survive hardware failures in distributed computing environments.Our technique also produces a timing model that characterizes the execution time as a function of the task failure rates. The combination of the distortion and timing models quantifies an accuracy/execution time tradeoff. It therefore enables the development of techniques that purposefully fail tasks to reduce the execution time while keeping the distortion within acceptable bounds.
Building similarity graph...
Analyzing shared references across papers
Loading...
Massachusetts Institute of Technology
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Martin Rinard (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: