Distributed Implementation of OpenMP Based on Checkpointing Aided Parallel Execution ´ Eric Renault GET / INT — Samovar UMR INT-CNRS 5157 91011 ´ Evry, France eric.renault@int-evry.fr Abstract. Checkpointers are used to secure the execution of sequential and par- allel programs. This article shows that they can also be used to generate a parallel program from a sequential program automatically, this program being executed on any kind of distributed parallel system. The article also presents how this new technique can be included inside the usual compilation chain to provide a dis- tributed implementation of OpenMP. Finally, some performance measurements are discussed. 1 Introduction Radical changes in the way of taking up parallel computing has operated during the past years, with the introduction of cluster computing [1], grid computing [2], peer-to- peer computing [3]... However, if platforms have evolved, development tools remain the same. As an example, HPF [4], PVM [5], MPI [6] and more recently OpenMP [7] have been the main tools to specify parallelism in programs (especially when supercomputers were the main issue for parallel computing), and they are still used in programs for cluster and grid architectures. Somehow, this shows that these tools are generic enough to follow the evolution of parallel computing. However, developers are still required to specify almost every information on when and how parallel primitives (for example sending and receiving messages) shall be executed. Many works [8,9] have been done in order to automatically extract parallel opportu- nities from sequential programs in order to avoid developers from having to deal with a specific parallel library, but most methods have difficulties to identify these paral- lel opportunities outside nested loops. Recent research in this field [10,11], based on pattern-maching techniques, allows to substitute part of a sequential program by an equivalent parallel subprogram. However, this promising technique must be associated an as-large-as-possible database of sequential algorithm models and the parallel imple- mentation for any target architectures for each of them. At the same time, the number of problems that can be solved using parallel ma- chines is getting larger everyday, and applications which require weeks (or months, or even more...) calculation time are more and more common. Thus, checkpointing tech- niques [12,13,14] have been developed to generate snapshots of applications in order to be able to resume the execution from these snapshots in case of problem instead of restarting the execution from the beginning. Solutions have been developed to re- sume the execution from a checkpoint on the same machine or on a remote machine, B. Chapman et al. (Eds.): IWOMP 2007, LNCS 4935, pp. 195–206, 2008. c Springer-Verlag Berlin Heidelberg 2008