Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms Yanyong Zhang Anand Sivasubramaniam Jos Moreira Hubertus Franke Department of Computer Science & Engineering IBM T. J. Watson Research Center The Pennsylvania State University P. O. Box 218 University Park, PA 16802 Yorktown Heights, NY 10598 yyzhang,anand @cse.psu.edu jmoreira,frankeh @us.ibm.com Abstract Scheduling of processes onto processors of a parallel machine has always been an important and challeng- ing area of research. The issue becomes even more crucial and difficult as we gradually progress to the use of off-the-shelf workstations, operating systems, and high bandwidth networks to build cost-effective clusters for demanding applications. Clusters are gaining acceptance not just in scientific applications that need supercom- puting power, but also in domains such as databases, web service and multimedia, which place diverse Quality- of-Service (QoS) demands on the underlying system. Further, these applications have diverse characteristics in terms of their computation, communication and I/O requirements, making conventional parallel scheduling so- lutions, such as space sharing or coscheduling, unattractive. At the same time, leaving it to the native operating system of each node to make decisions independently can lead to ineffective use of system resources whenever there is communication. Instead, an emerging class of dynamic coscheduling mechanisms, that attempt to take remedial actions to guide the system towards coscheduled execution without requiring explicit synchronization, offers a lot of promise for cluster scheduling. Using a detailed simulator, this paper evaluates the pros and cons of different dynamic coscheduling alter- natives, while comparing their advantages over traditional coscheduling (and not performing any coordinated scheduling at all). The impact of dynamic job arrivals, job characteristics and different system parameters on these alternatives are evaluated in terms of several performance criteria. In addition, heuristics to enhance one of the alternatives even further are identified, classified and evaluated. It is shown that these heuristics can signifi- cantly outperform the other alternatives over a spectrum of workload and system parameters, and is thus a much better option for clusters than conventional coscheduling. Keywords: Parallel Scheduling, Coscheduling, Dynamic Coscheduling, Clusters, Simulation. This research has been supported in part by NSF Career Award MIP-9701475, grant CCR-9900701, and equipment grants CDA-9617315 and EIA-9818327. This is a substantially revised and extended version of a preliminary paper [50] that appeared in Proceedings of the ACM International Conference on Supercomputing (ICS), pages 100-109, May 2000. The revisions include a more detailed evaluation, consideration of more workload and system parameters, and the design and evaluation of a novel set of heuristics to be used in conjunction with the proposed Periodic Boost mechanism.