Nomadic Migration: A New Tool for Dynamic Grid Computing Gerd Lanfermann, Gabrielle Allen, Thomas Radke, Edward Seidel Abstract- We describe the design and implementation of a technology which provides an application with the abil- ity to seek out and exploit remote computing resources by migrating tasks from site to site, dynamically adapting the application to a changing Grid environment. The motiva- tion for this migration framework, dubbed “The Worm”, originated from the experience of having an abundance of computing time for simulations, which is distributed over multiple sites and split in time chunks by queuing systems. We describe the architecture of the Worm, describing how new or more suitable resources are located, and how the payload simulation is migrated to these resources following a trigger event. The migration technology presented here is designed to be used for any application, including large-scale HPC simulations. I. INTRODUCTION RID computing involves utilizing computational re- G sources, connected by networks, as needed to solve problems. %cent advances in Grid computing are such that applications are now in a position to begin to exploit a wide range of available computer resources, simultane- ously, sequentially, or both, enabling many different new and innovative Grid usage scenarios. Adding up the theoretically available computing time across a pool of standard computers, such as idle work- stations, or summing the total computing time granted to a research group by several independent super computing sites will typically yield an impressive capacity of process- ing power. But these resources are neither available on a homogenous architecture base nor are they all continuously accessible over a long period of time. Here we have focussed on a new type of Grid computing appropriate for its dynamic character: self-determined mi- gration of a simulation from one site or collection of sites [4] to any other. We present a migration technology, dubbed “the Worm”, designed for parallel applications with high IO and memory requirements, driven by the need for per- forming large scale simulations on HPC machines [5]. The technology is also applicable for the efficient use of idle cycles from small machine pools. 11. PROTOTYPE EXPERIENCES A prototype implementation of the Worm was already demonstrated at Supercomputing 2000, running across the machines of the EGrid Testbed [9]. The Worm was im- plemented in the Cactus programming framework [l], [2]. The Worm’s payload, which describes the simulation code G.Lanfermann,G.Allen, T.Radke and ESeidel are with the Max- Planck-Institut fur Gravitationsphysik, Albert-Einstein-Institut, Golm (AEI), E.Seide1 is also with the National Center for Super- computing Applications, Champaign, IL, (NCSA) 0-7695-1296-8/01 $10.00 0 2001 IEEE :: :: . . .. , , .. .. .. .. .. : Data Acccrr .___...______ .. .. ..... . . *. ,. .. Off-Sile Dnln Slornpe , , .. Fig. 1. We show the main components of a resource aware, self- replicating Worm. The user payload is encapsulated by the Worm Kernel, acting as a contact point to the resource detector and various application information s e m r s (AIS). The transfer units stage exe- cutables to machines and provides storage for checkpoint files during hibernation. provided by scientists, was a simulation of a wave equa- tion, although any real application written in the Cactus framework can trivially be incorporated as a payload. The participating EGrid sites [7] published characteristic system profiles and load information to a central Resource Manager. While the prototype Worm simulation was run- ning on machines in the Grid, it was able to query this central resource service, seeking available machines of a certain configuration. Using this information it could then migrate to another site according to some predefined crite- ria. 111. WORM TECHNOLOGY IN A DYNAMIC GRID ENVIRONMENT Based on the early prototype experiences [9] a more sc- phisticated Worm framework was desgined to provide the user with range of migration policies and to overcome nu- merous technical challenges when dealing with heteroge- nous grid environments. With the new worm technology, application migration can be inititated by a variety of cases, which range from the user’s manual triggering of a migra- tion event to fully automatic application relocation: by monitoring the simulation performance and profiling the current hardware with small benchmarking programms a detailed profile of the current execution environment is gen- erated. Periodic lookups are performed to see if “better” re- sources for this individual profile have become available - in which case a migration to the more suited resource may be initiated. Note that in this context ‘Lbetter”does not necessarily mean “faster” but also “cheaper”, “more 429