1 Abstract—In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft, real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies. Index Terms—Computational grids, fault tolerance and recovery, resilient scientific workflows, scheduling. I. INTRODUCTION ARGE and complex scientific workflows rely on computational grids to satisfy their massive computational and data requirements. With increasing heterogeneity and complexity of computational grids, executing large scientific workflows reliably becomes a challenge. Although the mean time to failure of any entity in a computational grid is high, the large number of entities in a grid (hardware, network, software, grid middleware, core services etc) means that a grid will fail frequently. For example, in [1], the authors studied the failure data from several high performance computing systems operated by Los Alamos National Laboratory (LANL) over nine years. Although failure rates per processor varied from 0.1 to 3 failures per processor per year, systems with 4096 processors averaged as many as 3 failures per day. Thus, although the number of failures per processor is relatively low, the aggregate reliability of a system clearly deteriorates as the number of processors is increased. Since the failure rates are roughly proportional to the number of processors in the system, with over 112000 processors on a computational grid like the TeraGrid [2], [3], one would experience a failure every two minutes. Unlike the LANL infrastructure, which is of high priority, very expensive and highly controlled with greater resources to maintain it, a computational grid is also susceptible to failures at the grid middleware level; software and services that tie the heterogeneous computing systems on a grid. Failure will be the norm rather than an exception. Hence workflow execution systems must be designed to execute workflows in a fault tolerant manner. Current fault tolerance and recovery strategies like FT-MPI [4] can regularly take checkpoints of a workflow step, which is usually a parallel scientific application, and in case of a failure, restart the application from the last good checkpoint. However, many scientific workflows like the ones used in LEAD [5], [6] are deadline driven and must execute with a minimum success probability. Moreover, complex scientific workflows may assimilate large input data sets from a distributed set of data sources. Hence, moving the computation closer to the data sources could minimize the total time to completion of the workflow. Also, dynamic scientific workflows change their configuration rapidly and automatically in response to external events or decision-driven inputs from scientists and can initiate other workflows automatically. All these factors clearly necessitate new approaches to scheduling and fault tolerance for large and complex scientific workflows on computational grids, beyond what current technologies have to offer. The main contributions of this paper are: An implementation of fault-tolerance and recovery strategies for workflows on computational grids An evaluation of the above strategies using the LEAD infrastructure for running dynamic and adaptive weather forecasting workflows Algorithms to find the degree of over-provisioning and the length of the migration path under the constraints of deadline and success probability Results from real usage examples in LEAD The rest of this paper is organized as follows. In section II we give an overview of our fault tolerance and recovery service and discuss our two primary fault tolerance strategies viz. over-provisioning and migration. In section III we show results of our evaluation of the FTR service. In section IV we discuss some related work and in section V we present our future work and conclusions. Fault Tolerance and Recovery of Scientific Workflows on Computational Grids Gopi Kandaswamy, Anirban Mandal, and Daniel A. Reed Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC - 27517, USA L