C 3 : A System for Automating Application-level Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill ⋆ Department of Computer Science, Cornell University, Ithaca, NY 14853 Abstract. Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state pe- riodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In ([2],[3]) we have presented a distributed checkpoint coordination protocol which handles MPI’s point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C 3 (Cornell Checkpoint (pre-)Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the proto- cols are small. We also discuss a number of future areas of research. 1 Introduction The problem of implementing software systems that can tolerate hardware failures has been studied extensively by the distributed systems community [6]. In contrast, the parallel computing community has largely ignored this problem because until recently, most parallel computing was done on relatively reliable big-iron machines whose mean- time-between-failures (MTBF) was much longer than the execution time of most pro- grams. However, trends in high-performancecomputing, such as the popularity of custom- assembled clusters, the increasing complexity of parallel machines, and the dawn of Grid computing, are increasing the probability of hardware failures, making it impera- tive that parallel programs tolerate such failures. One solution that has been employed successfully for parallel programs is application- level checkpointing. In this approach, the programmer is responsible for saving compu- tational state periodically, and for restoring this state after failure. In many programs, ⋆ This work was supported by NSF grants ACI-9870687, EIA-9972853, ACI-0085969, ACI- 0090217, ACI-0103723, and ACI-0121401.