Fault-Tolerant Scheduling for Bag-of-Tasks Grid Applications Cosimo Anglano and Massimo Canonico Dipartimento di Informatica, Universit`a del Piemonte Orientale, Alessandria, Italy {cosimo.anglano, massimo.canonico}@unipmn.it Abstract. In this paper we propose a fault-tolerant scheduler for Bag- of-Tasks Grid applications, called WorkQueue with Replication Fault Tol- erant (WQR-FT), obtained by adding checkpointing and replication to the WorkQueue with Replication (WQR) scheduling algorithm. By using discrete-event simulation, we show that WQR-FT not only ensures the successful completion of all the tasks in a bag, but also achieves perfor- mance better than WQR and other fault-tolerant schedulers obtained by coupling WQR with replication only, or with checkpointing only. 1 Introduction Grid Computing technology provides resource sharing and resource virtualiza- tion to end-users, allowing for computational resources to be accessed as a util- ity. By dynamically coupling computing, networking, storage, and software re- sources, Grid technology enables the construction of virtual computing platforms capable of delivering unprecedented levels of performance. However, in order to take advantage of Grid environments, suitable application-specific schedul- ing strategies, able to select, for a given application, the set of resources that maximize its performance, must be devised [2]. The inherent wide distribution, heterogeneity, and dynamism of Grid environments makes them better suited to the execution of loosely-coupled parallel applications, such as Bag-of-Tasks [11] (BoT) applications, rather than of tightly-coupled ones. Bag-of-Tasks ap- plications (parallel applications whose tasks are completely independent from one another) are particularly able to exploit the computing power provided by Grids [6] and, despite their simplicity, are used in a variety of domains, such as parameter sweep, simulations, fractal calculations, computational biology, and computer imaging. Therefore, scheduling algorithms tailored to this class of ap- plications have recently received the attention of the Grid community [3, 4, 6]. Although these algorithms enable BoT applications to achieve very good per- formance, they suffer from a common drawback, namely their reliance on the assumption that the resources in a Grid are perfectly reliable, i.e. that they will This work has been supported by the Italian MIUR under the project Societ´a dell’Informazione, Sottoprogetto 3 - Grid Computing: Tecnologie abilitanti ed ap- plicazioni per eScience, L. 449/97, anno 1999. P.M.A. Sloot et al. (Eds.): EGC 2005, LNCS 3470, pp. 630639, 2005. c Springer-Verlag Berlin Heidelberg 2005