Managing Mixed-Use Clusters with Cluster-on-Demand Justin Moore, David Irwin, Laura Grit, Sara Sprenkle, and Jeff Chase Department of Computer Science Duke University justin,irwin,grit,sprenkle,chase @cs.duke.edu Abstract Although clusters offer inexpensive computing power, they are difficult and expensive to manage, particularly for user communities with diverse software needs. This paper presents Cluster-on-Demand (COD), a cluster op- erating system framework for mixed-use clusters. COD interposes on standard network management services — DHCP, NIS, and DNS — to partition a cluster into dy- namic virtual clusters (vclusters) with independent in- stalled software, name spaces, access controls, and net- work storage volumes. COD allocates nodes to vclusters on-the-fly, reconfiguring them as needed with PXE net- work boots. A key element of COD is a protocol to resize vclusters dynamically in cooperation with pluggable mid- dleware components such as batch schedulers. The COD framework is a key building block for automated manage- ment of computing utilities and grids. 1 Introduction Clustering inexpensive computers is an effective way to obtain reliable, scalable computing power for network services and compute-intensive applications. Since clus- ters have a high initial cost of ownership, including space, power conditioning, and cooling equipment, leasing or sharing access to a common cluster is an attractive solu- tion when demands vary over time. Shared clusters enable more effective use of resources by multiplexing, and they offer economies of scale in administration as personnel costs grow even as hardware costs decline. There has been a great deal of research and progress in managing clusters since the early days of the NOW project [5]. The most successful systems today main- tain a homogeneous software environment for a specific class of applications. These systems — including Be- owulf [1], load-leveling batch schedulers [2, 3], Millen- nium [10], Rocks [21], and other elements of the NPACI This work is supported in part by the U.S. National Science Foun- dation (EIA-9972879 and EIA-9870728), by HP Labs, and by IBM through a SUR equipment grant and an IBM faculty research award. grid toolset — target batch computations written for com- mon OS or middleware APIs. These are powerful tools, but one size does not fit all: users of a shared cluster should be free to select the software environments that best support their needs, which may involve multiple op- erating systems, multiple batch classes, Web applications, and multiple Grid points-of-presence, each serving a dif- ferent segment of the user community. Tools to manage mixed-use clusters are still lacking. This paper describes the architecture and implementation of Cluster-on-Demand (COD), a system to enable rapid, automated, on-the-fly partitioning of a physical cluster into multiple independent virtual clusters. A virtual clus- ter (vcluster) is a subset of cluster nodes configured for a common purpose, with associated user accounts and stor- age resources, a user-specified software environment, and a private IP address block and DNS naming domain. COD vclusters are dynamic; their node allotments may change according to demand or resource availability. COD was inspired by Oceano [7], an IBM Labs project to automate a Web server farm. Like Oceano, COD lever- ages remote-boot technology to reconfigure cluster nodes using database-driven network installs from a set of user- specified configuration templates, under the direction of a policy-based resource manager. Emulab [26] uses a sim- ilar approach to configure groups of nodes for network emulation experiments on a shared testbed. Section 2.2 sets COD in context with these and other related systems. The primary contribution of COD is to extend these tech- niques to a general framework for a cluster operating sys- tem. Like a conventional OS, COD allocates resources to its users, isolates user environments from one an- other, mediates interactions with the external environ- ment, and manages shared resources dynamically as de- mands change. Rather than allocating slivers of each node’s memory and CPU to user processes, COD allo- cates complete machines to vclusters shared by a group of users. Users assume full control of their machines down to the bare metal: COD installs user-specified soft- ware in each vcluster, analogously to a conventional OS instantiating a user program in a process. Most impor- 1