DDG Task Recovery for Cluster Computing ⋆ G. T. Nguyen 1 , L. Hluchy 1 , V. D. Tran 1 , and M. Kotocova 2 1 Institute of Informatics, SAS, Dubravska cesta 9, 84237 Bratislava, Slovakia giang.ui@savba.sk 2 Department of Computer Science, STU, Ilkovicova 3, 81219 Bratislava,Slovakia Abstract. This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of work- stations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Per- formance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtime application can restart from a checkpoint instead of restarting whole computation. This paper presents the fault-tolerant feature of the DDG environment ori- ented to cluster systems without hardware spare. 1. Introduction Nowadays, advances in information technologies have led to increased interests in the use of clusters of workstations for large and long runtime applications. The main advantages of cluster systems are scalability and good price/performance ratio [6]. As the size of the system is unsteady increased, the probability of fault occurrence in some nodes of the system also increased. Therefore, it is very im- portant to assure that applications may continue despite the occurrence of faults. To fulfill this, the redundancy in the system is an important assumption. While it is still difficult to write an efficient parallel program for cluster computing, it is more complicated to provide fault-tolerant features. High performance can be reached by using parallelism as much as possible [7], but the attention to providing it after fault occurrence, combining with scheduling environment is rare. Most often, spare processors are not attractive and used. In this paper, we propose a method to provide a fault-tolerant feature combining with task real- locations for parallel and distributed programs. The checkpointing and recovery process is made in such way that the application achieves as small overhead as possible. The target platform of the method is represented by distributed- memory systems with message-passing communication such as cluster systems. The rest of the paper is organized as follows. Section 2 describes basic ideas of DDG parallel programming environment and a general overview of fault-tolerant problem in multiprocessor systems and particularly in DDG environment. Sec- tion 3 describes the saving-checkpointing model. Section 4 describes task reallo- cation and system reconfiguration. Section 5 contains some experimental results and section 6 is a conclusion. ⋆ PPAM 2001, pp. 369-376, Springer-Verlag, LNCS 2328, ISBN 3-540-43792-4, ISSN 0302-9743. September 2001, Naleczow, Poland