Multi-Dimensional Dynamic Loop Scheduling Algorithms Anthony T. Chronopoulos #1 , Lionel M. Ni *2 , Satish Penmatsa #3 # Department of Computer Science, University of Texas at San Antonio One UTSA Circle, San Antonio, TX 78249, USA 1 atc@cs.utsa.edu 3 spenmats@cs.utsa.edu * Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong 2 ni@cse.ust.hk Abstract— Distributed Computing Systems are a viable and less expensive alternative to parallel computers. However, a serious difficulty in concurrent programming of a distributed system is how to deal with scheduling and load balancing of such a system which may consist of heterogeneous computers. Loop scheduling schemes for parallel computers and computer clusters have been proposed in the past. All these schemes are one-dimensional because they partition only the outermost loop of a nested loop construct. In this work, we consider scheduling nested loops with many dimensions. We propose a new methodology which partitions many levels (or dimensions) of nested loops. These new schemes show superior performance over the existing schemes. We implement our new schemes on a network of computers and make performance comparisons with other existing schemes. We expect the new schemes to be particularly useful for multi-core systems because of the fine granularity of the generated tasks. I. I NTRODUCTION Loops are one of the largest sources of parallelism in scientific programs. If the iterations of a loop have no inter- dependencies, each iteration can be considered as a task and can be scheduled independently. Such parallel loops are often called DOALL loops. The loops that have interdependencies are often called DOACROSS loops. Loop scheduling schemes for parallel and distributed systems have been proposed and studied in the past. For example, See ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22]) and references therein. Heterogeneous systems are characterized by heterogeneity and large number of processors. Some distributed schemes that take into account the characteristics of the different com- ponents of the heterogeneous system were devised in the past; for example: 1) Tree Scheduling and 2) Weighted Factoring ([7], [3]). Distributed loop scheduling schemes, which can be applied to DOALL and DOACROSS loops and take into account the available computing powers of the computers, have also been devised ([4], [23], [24], [13]). This work was supported in part by the National Science Foundation under grant CCR-0312323. A.T.Chronopoulos is an IEEE Senior Member, L.M.Ni is an IEEE Fellow and S.Penmatsa is an IEEE Student Member. In this article, we review some well known scheduling schemes for DOALL loops. The ‘simple’ versions of these schemes are the versions suitable for homogeneous systems with single-user-job (dedicated) execution mode. The ‘dis- tributed’ versions are suitable for heterogeneous systems. A key issue in achieving high (delivered) performance in concurrent processing lies in scheduling nested program loops to execute as efficiently as possible. All the dynamic schedul- ing schemes (previously proposed) partition only the out- ermost loop of a program loop structure and assign tasks (chunks) to the processors. This is not efficient for multi- dimensional nested loops. All the previous ‘multi-dimensional’ loop scheduling schemes for nested loops (e.g. [25]) are static. Thus, these methods are inefficient when the loop tasks sizes are unequal. This calls for devising new ‘multi-dimensional’ dynamic loop scheduling methods. To our knowledge this has not been attempted before. Here, we propose new dynamic loop scheduling schemes for computing nested DOALL loops on parallel and distributed systems. We implemented the new schemes (in C++ and MPI) on a network of computers. We show that the new schemes are superior to the previous schemes by simulation results on nested loops with irregular iterations task sizes. The following are common notations used throughout the entire paper: PE is a processor in the parallel or heterogeneous system. I is the total number of iterations of a parallel loop. p is the number of worker PEs in the parallel or hetero- geneous system which execute the computational tasks. P 1 , P 2 ,..., P p represent the p worker PEs in the system. N is the number of scheduling steps (= the total number of chunks). A few consecutive iterations are called a chunk. C i is the i-th chunk-size (where: i =1, 2,...,N ). The i-th chunk is assigned to the k i (where: k i ∈{1, 2,...,p}) worker PE making the i-th request. R i is the remaining number of tasks after scheduling the i-th chunk. 1-4244-1388-5/07/$25.00 2007 IEEE 2007 IEEE International Conference on Cluster Computing 241 Authorized licensed use limited to: University of Texas at San Antonio. Downloaded on May 13,2010 at 19:08:05 UTC from IEEE Xplore. Restrictions apply.