GLB: A Low-Cost Scheduling Algorithm for Distributed-Memory Architectures Andrei R˘ adulescu Arjan J.C. van Gemund Department of Information Technology and Systems Delft University of Technology P.O.Box 5031, 2600 GA Delft, The Netherlands Abstract This paper proposes a new compile time scheduling algorithm for distributed-memory systems, called Global Load Balancing (GLB). GLB is intended as the second step in the multi-step class of scheduling algorithms. Experi- mental results show that compared with known scheduling algorithms of the same low-cost complexity, the proposed algorithm improves schedule lengths up to . Compared to algorithms with higher order complexities, the typical schedule lengths obtained with the proposed algorithm are at most twice longer. 1. Introduction One of the main problems in the field of scheduling al- gorithms for distributed memory systems is finding time- efficient heuristics that produce good schedules. The goal of scheduling is to minimize the parallel execution time of the scheduled program. Except for very restricted cases, the scheduling problem has been shown to be NP-complete [3]. Consequently, much research effort has been spent in find- ing good heuristics. For shared-memory architectures, it has been proven that even a low-cost scheduling algo- rithm is guaranteed to produce acceptable performance [4]. For distributed-memory systems however, such a guarantee does not exist. The heuristic algorithms used for task scheduling on distributed-memory machines can be divided into (a) scheduling algorithms for an unbounded number of proces- sors and (b) scheduling algorithms for a bounded number of processors. Scheduling for an unbounded number of pro- cessors can be performed easier, because the constraint on the number of processors need not be considered. Within this class a distinction can be made between clustering and duplication-based algorithms. Clustering algorithms, such as Dominant Sequence Clustering (DSC) [14] and Edge Ze- roing (EZ) [10], groups tasks together to reduce commu- nication. Duplication-based algorithms, such as Scalable Task Duplication based Scheduling (STDS) [2] and Dupli- cation First Reduction Next (DFRN) [8], further reduce the communication delays by task duplication. An important class of scheduling algorithms for a bounded number of processors is the class of list schedul- ing algorithms, such as Modified Critical Path (MCP) [12] and Earliest Task First ETF [5], which sequentially sched- ule “ready” tasks to the task’s “best” processor. A ready task is defined to be a task with all its dependencies satis- fied and a best processor is determined by the criteria used to select processors (e.g. the processor where the task can start the earliest). Secondly, duplication can be performed also for a bounded number of processors (e.g. Duplication Scheduling Heuristic (DSH) [6] and Critical Path Fast Du- plication CPFD [1]). A third approach is to use a multi- step method. In such a method, three steps can be defined: (1) clustering, (2) cluster mapping and (3) task ordering. The clustering step groups tasks in clusters, The cluster mapping step maps the clusters to the available number of processors, while the task ordering step orders the tasks’ ex- ecution within processors, according to task dependencies. In practical situations, where the number of tasks may be extremely large, the time complexity of a scheduling algo- rithm is very important. Scheduling for an unbounded num- ber of processors can be performed with low-complexity. However, the necessary number of processors is rarely available. Within the class of scheduling algorithms for a bounded number of processors, duplication-based al- gorithms have high-complexities, because they perform a backward search in order to duplicate tasks. Compared to clustering algorithms, list scheduling algorithms have higher complexities, because they have to solve moreover the constraint of a limited number of processors. Multi- step scheduling methods achieve the same low complexity as clustering, provided the other two steps, cluster mapping and task ordering, have the same or lower complexity as the clustering step. Because task ordering is basically a topo- logical sort, it can be performed fast (e.g. Ready Critical Path (RCP) [13], Free Critical Path (FCP) [13]). Clus- ter mapping can also be performed at a low cost, using a