Reparallelization and Migration of OpenMP Programs Michael Klemm, Matthias Bezold, Stefan Gabriel, Ronald Veldema, and Michael Philippsen Computer Science Department 2 • University of Erlangen-Nuremberg Martensstr. 3 • 91058 Erlangen • Germany {klemm, veldema, philippsen}@cs.fau.de, bezold@msbezold.de, stefan-gabriel@gmx.net Abstract Typical computational grid users target only a single cluster and have to estimate the runtime of their jobs. Job schedulers prefer short-running jobs to maintain a high system utilization. If the user underestimates the runtime, premature termination causes computation loss; overesti- mation is penalized by long queue times. As a solution, we present an automatic reparallelization and migration of OpenMP applications. A reparallelization is dynamically computed for an OpenMP work distribution when the num- ber of CPUs changes. The application can be migrated between clusters when an allocated time slice is exceeded. Migration is based on a coordinated, heterogeneous check- pointing algorithm. Both reparallelization and migration enable the user to freely use computing time at more than a single point of the grid. Our demo applications successfully adapt to the changed CPU setting and smoothly migrate between, for example, clusters in Erlangen, Germany, and Amsterdam, the Netherlands, that use different processors. Benchmarks show that reparallelization and migration im- pose average overheads of about 4% and 2%. 1. Introduction While offering novel computing opportunities, the boundaries between individual clusters of a computational grid are still visible to users. In addition to heterogeneity as a problem, the user is faced with a cluster’s job scheduling mechanism that assigns computing resources to jobs. Usu- ally, the scheduler prefers short-running over long-running jobs and it prefers jobs that only need a small number of CPUs over more demanding ones. Short jobs with a only a few CPUs increase the cluster’s utilization, while long- running jobs or jobs that require many CPUs often cause un- productive reservation holes [26]. To be fair to waiting users the job manager terminates jobs that exceed their claimed resource limit. A terminated application loses all computed work. However, it is difﬁcult to provide an exact estimation of a job’s runtime, as often runtime depends on the input and is inﬂuenced by unpredictable environmental issues (e. g. the load of the network, which in turn depends on a cluster’s overall load). Generally, two “solutions” exist for a user. First, a user can request a too long time slice and accept the penalty of more waiting time until the job runs. As an edu- cated guess, a user might double the estimated time to avoid losing results upon termination of the program. Second, the program is rewritten into a number of smaller phases, which can then run within more predictable time boundaries. Reparallelization and migration are not only a third solu- tion to this problem but in addition they blur the boundaries between clusters. The user can start the application with a certain number of CPUs and a small time estimate at any cluster of the grid. If the application is about to exceed the time slice or if a more powerful cluster becomes available, the application checkpoints and migrates to another cluster, transparently adapting to the potentially different architec- ture and to a changed degree of parallelism. Our solution consists of two parts: (1) OpenMP [15] pro- grams can transparently alter the number of threads during the execution of a parallel region, and (2) checkpoint-based migration between clusters is supported which enables an application to halt on one cluster and to resume on another one. Both the origin and the target can be of different ar- chitectures. Reparallelization is crucial for migration, since the number of available CPUs is likely to change. And even without migration, reparallelization allows the next local re- source reservation to request fewer or more CPUs depend- ing on the overall system load and queuing times. Our pro- totype is implemented on top of the Software Distributed Shared Memory (S-DSM) Jackal [23], a shared memory emulation for Java on clusters. Jackal’s compiler supports JaMP [12], an OpenMP 2.5 port to Java. The paper is organized as follows. Section 2 covers re- lated work. Section 3 describes the OpenMP reparalleliza- tion. Section 4 discusses the migration and the distributed checkpointing algorithm. Section 5 presents the perfor- mance of both OpenMP reparallelization and migration. 1