Strategies for Storage of Checkpointing Data using Non-dedicated Repositories on Grid Systems Raphael Y. de Camargo Dept. of Computer Science University of S˜ ao Paulo, Brazil rcamargo@ime.usp.br Renato Cerqueira Dept. of Computer Science PUC-Rio, Brazil rcerq@inf.puc-rio.br Fabio Kon Dept. of Computer Science University of S˜ ao Paulo, Brazil kon@ime.usp.br ABSTRACT Dealing with the large amounts of data generated by long- running parallel applications is one of the most challeng- ing aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more data. The classical approach is to employ high- throughput checkpoint servers connected to the computa- tional nodes by high speed networks. In the case of Op- portunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion. In this work, we evaluate several strategies to store check- points on distributed non-dedicated repositories. We con- sider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA). We used InteGrade, an object- oriented Grid middleware, to implement the storage strate- gies and perform evaluation experiments. Categories and Subject Descriptors C.2.4 [Computer-communication Networks]: Distrib- uted Systems—distributed applications ; C.4 [Performance of Systems]: [fault tolerance]; E.4 [Coding and Infor- mation Theory]: [error control codes] General Terms Performance, Reliability Keywords Fault-tolerance, Distributed storage, Data coding, Check- pointing, Grid Computing This work is supported by a grant from CNPq, Brazil, pro- cess #55.2028/02-9. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MGC’05, November 28-December 2, 2005 Grenoble, France Copyright 2005 ACM 1-59593-269-0/05/11 ...$5.00. 1. INTRODUCTION Executing computationally intensive parallel applications on dynamic heterogeneous environments, such as Computa- tional Grids [3, 8, 4], is a daunting task. This is particularly true when using non-dedicated resources, as in the case of opportunistic computing [11]. Machines may fail, become unavailable, or change from idle to occupied unexpectedly, compromising the execution of applications. Different from dedicated resources, whose MTBF (mean- time between failures) is typically in the order of weeks or even months [12], non-dedicated resources can become un- available several times during a single day. Moreover, some machines can remain unavailable for more time than avail- able. A fault-tolerance mechanism, such as checkpoint-based rollback recovery [7], can be used to guarantee application execution progression in the presence of frequent failures. Moreover, the checkpointing mechanism can be used for pro- cess migration, allowing the implementation of efficient pre- emptive scheduling algorithms for parallel applications on the Grid. The generated checkpoints need to be saved on a stable storage medium. The machine where the application is run- ning cannot be considered a stable storage medium because it can become unavailable. The usual solution is to install checkpoint servers connected to the nodes by a high speed network. But since our focus is on an opportunistic com- puting environment, we do not want to be forced to rely on such dedicated hardware. The natural choice would be to use the Grid nodes as the storage medium for checkpoints. We use InteGrade 1 [10], a multi-university effort to build a Grid middleware to leverage the computing power of idle shared workstations, as the platform for the implementation of the distributed storage system and experiments. The cur- rent InteGrade version has support for portable checkpoint- ing of sequential, parameter sweeping, and BSP parallel ap- plications [5]. A distributed storage system must ensure scalability and fault-tolerance for the storage, management, and recovery of application data. We expect to fulfill the scalability require- ment by developing algorithms to distribute the data on non-dedicated repositories. We explore several techniques to provide fault-tolerance, such as data replication, Infor- mation Dispersal Algorithms (IDA) [17], and addition of parity information. In this paper, we describe the implementation of a dis- 1 http://integrade.incubadora.fapesp.br/