Scheduling Loops on Parallel Processors: A Simple Algorithm with Close to Optimum Performance Franco Gasperoni 1 and Uwe Schwiegetshohn 2 1 gasperon~cs.nyu.edu CIMS-NYU, 251 Mercer Street, New York, NY 10012 2 uwe@watson.ibm.com IBM Research Center, P.O. Box 218, Yorktown Heights, NY 10598 Abstract. This paper addresses the NP-hard problem of scheduling a cyclic set of interdependent operations, representing a program loop, when a finite number of identical processors are available. We give a simple and efficient algorithm producing close to optimum results. As a side result our algorithm guarantees optimality when the number of processors is infinite. 1 Introduction With advances in hardware technology most of today's high performance micro- processors and computers offer some degree of parallelism. To take advantage of this concurrency a common approach has been to employ parallelizing compilers which automatically extract the parallelism present in sequential programs. Most of the concurrency present in these applications is expressed in the form of loops and considerable efforts have been devoted to loop parallelization ([6, 7, 8, 9] to name a few). Because this problem is NP-hard when the target machine has finite resources (see sectioll 2), loop paraJ]ellzation algorithms have assumed in- finitely many processors or have validated performance results only by means of benchmarking. Our goal is to provide a simple and efficient loop parallelization algorithm with close to optimum performance when the number of available pro- cessors is finite. In our machine model time is viewed as a discrete rather than continuous entity. A time instant is called a ,,cycle,, or "execution cycle". The machine it- sels consists of p identical processors. There is no preemption: once started, an operation has to be executed without interruption. Two type of processors will be considered: pipelined and unpipellned. In the pipelined case p new operations can be initiated every cycle even though the operations initiated in the previous cycle may not have completed yet. In the unpipetined case a new operation can be initiated on a given processor only if the operation previously executing on that processor has completed. Note that if all operations take one cycle to exe- cute the two machine models are equivalent. A program loop is modeled as a doubly weighted directed graph G = (0, E, 6, d),