CCK: An Improved Coordinated Checkpoint/Rollback Protocol for Dataflow Applications in KAAPI Xavier Besseron, Samir Jafar*, Thierry Gautier and Jean-Louis Roch Projet MOAIS(CNRS/INPG/INRIA/UJF) - Laboratoire ID-IMAG(UMR 5132) Monbonnot ZIRST/5 1 avenue Jean Kuntzmann - 38330 Monbonnot - FRANCE Email:(Xavier.Besseron, Samir.Jafar, Thierry.Gautier, Jean-Louis.Roch)@imag.fr Keywords. Parallel Application, Dataflow Graph, Checkpoint/Recovery. Abstract Fault tolerance protocols play an important role in today long runtime scientific parallel applications because the probability of failure may be important due to the num- ber of unreliable components involved during simula- tion. In this paper we present our approach and prelim- inary results about a new checkpoint/recovery protocol based on a coordinated scheme. This protocol is highly coupled to the availability of an abstract representation of the execution. 1 Introduction Since few years, fault-tolerance has been studied in the context of scalable parallel applications which allow to make simulation of complex phenomena using large scale cluster [10, 3]. Due to the number of unreliable components involved during the computation, the ap- parition of faults is not an exceptional event: the sys- tem or the middleware should provide fault-tolerance protocols so as to mask failures. The subject has been well studied in the context of distributed systems and distributed middlewares [4, 6]. The renewed interest is that optimizing performance becomes a major objec- tive. Recent propositions study the runtime behavior of applications in order to specialize or extend published protocols [9, 3]. The idea behind this research direction wants to au- tomatically adapt a fault-tolerance protocol to the mini- mal requirements of an application about dependability features. This paper is in this context: the specialization of fault-tolerance protocol is done at the level of an ab- stract representation of the execution which permits im- portant optimizations at runtime. We based our work in *This author is supported by a grant of the Syrian Government the framework of KAAPI [10, 9], where the abstract rep- resentation of execution was firstly designed to be able to plug scheduling algorithms independently of applica- tions. In [10] it was shown that this abstract represen- tation is well suited for defining the checkpoint of local process. In the context of this paper, this abstract repre- sentation is used to specialize a fault-tolerance protocol for long runtime iterative simulation. Coordinated checkpoint/rollback protocols are promising for large scale parallel applications because they do not add extra overhead on communication and current experiments demonstrate their availability to scale up to thousands of processors [6, 3], including the global synchronization. In case of fault, all the pro- cessors restart from their most recent checkpoints, even those which did not fail. The two challenging problems about performances of coordinated checkpoint/rollback protocols are: 1. How to speed up restart of processors after the oc- currence of a fault? 2. How to reduce the amount of lost computation time in case of fault? In [6, 3] the solution to solve (1) is: each processor keep a local copy of its checkpoint and send an other copy to either a stable storage [3] or either to a fixed number of neighbor processors [6]. Within this approach, all processors except the failed processor, restart from their local copy of the most recent checkpoint. Our contribution is mainly to propose a solution for (2). Thanks to the abstract representation of execution of any KAAPI's applications, it is possible to compute the strictly required set of computation to resend mes- sages to the failed processor. Moreover by adapting the local scheduling of tasks we present an optimization that may improve this required computation without im- pacting the parallel performance of the execution. The outline of the paper is the following. The next section deals with related works. Section three presents our improved coordinated checkpoint/rollback protocol for KAAPI applications. It begins with an overview of 1 0-7803-9521-2/06/$20.00 ยง2006 IEEE. 3353