498 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 5. MAY 1994 Scheduling DAG’ s for Asynchronous Multiprocessor Execution Brian A. Malloy, Errol L. Lloyd, and Mary Lou Soffa Abstract-A new approach is given for scheduling a sequential instruction stream for execution ‘4n parallel” on asynchronous multiprocessors. The key idea in our approach is to exploit the fine grained parallelism present in the instruction stream. In this context, schedules are constructed by a careful balancing of execution and communication costs at the level of individual instructions, and their data dependencies. Three methods are used to evaluate our approach. First, several existing methods are extended to the fine grained situation considered here. Our approach is then compared to these methods using both static schedule length analyses, and slmulated executions of the sched- uled code. In each instance, our method is found to provide significantly shorter schedules. Second, by varying, parameters such as the speed of the instruction set, and the speed/parallelism in the interconnection structure, simulation techniques are used to examine the effects of various architectural considerations on the executions of the schedules. These results show that our approach provides significant speedups in a wide-range of situations. Third, schedules produced by our approach are executed on a two- processor Data General shared memory multiprocessor system. These experiments show that there is a strong correlation between our simulation results (those parameterixed to “model” the Data General system), and these actual executions, and thereby serve to validate the slmulation studies. Together, our results establish that fine grained parallelism can be exploited in a substantial manner when scheduling a sequential instruction stream for execution “ln parallel” on asynchronous multiprocessors. Index Terms- Concurrency, parallelism, multiprocessor, line grained parallelism, schedule, asynchronous. I. INTRoDUC~~N 0 VER the past decade or so, changes in technology have provided the possibility for vast increases in computa- tional speed and power through the exploitation of parallelism in program execution. Indeed, within certain computational domains, these technological changes have permitted solutions to computation intensive problems such as weather modeling, image processing, Monte Carlo simulations and sparse matrix problems. An important part of this technology has focused on two approaches to parallelizing a sequential instruction stream: 1) exploiting fine grained parallelism, such as single state- ments, for VLIW machines, [8] and 2) exploiting coarse grained parallelism, such as loops and procedures, on vectorizable machines and on asyn- chronous multiprocessors. Manuscript received May 26, 1992; revised May 13, 1993. B.A. Malloy is with the Department of Computer Science, Clemson University, Clemson, SC 29634, USA. E-mail: malloy@cs. Clemson. edu. E. L. Lloyd is with the Department of Information and Computer Sciences, University of Delaware, Newark, DE 19716, USA. E-mail: elloyd@dewey. udeledu. M.L. Soffa is with the Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260, USA. E-mail: soffa@cs.pitt.edu. IEEE Log Number 9216779. In the first approach, VLIW machines support the concurrent execution of multiple instruction streams and perform many operations per cycle. VLIW machines however, also employ a single control unit, thereby permitting only one branch to be executed per cycle. Furthermore, while the VLIW architectures perform well on programs dealing with scientific applications, their performance can degrade rapidly when faced with factors that decrease run-time predictability. [27] In particular, although general purpose programs typically have an abundance of fine grained parallelism, it is difficult to exploit that parallelism on a VLIW machine because general purpose programs are much less predictable than scientific applications. In the second approach, existing techniques for asynchronous multiprocessors produce schedules at the coarse grained level. Due to their multiple control units, asynchronous multiprocessors have greater flexibility than VLIW machines. Unfortunately, it is frequently the case that a program segment may be unable to support coarse grained parallelism because it does not contain any loops, or because the data dependencies in its loops preclude such concurrentization. Thus, asynchronous multiprocessors, currently present in many installations, are frequently underutilized due to the absence of techniques to exploit fine grained parallelism in an asynchronous manner. In this paper we offer an alternative approach to the exploita- tion of parallelism in programs by combining the fine grained approach of the VLIW with the flexibility of the asynchronous machine. In so doing, we thereby provide a mechanism by which parallelism may be exploited in programs where factors are predictable (such as scientific applications), as well as in programs with unpredictable factors (such as general purpose applications). Thus, we focus on exploiting fine grained parallelism to schedule a sequential instruction stream for execution on an asynchronous multiprocessor system. Recall the processors in an asynchronous multiprocessor execute independently and that communication is performed explicitly through asyn- chronous communication primitives. It follows that sched- uling for such systems will necessarily involve packing to- gether fine grained operations, including synchronization com- mands, for execution on the individual processors. The dif- ficulty in such scheduling lies in balancing the desire to utilize all of the processors, with the desire to minimize the amount of synchronization that is introduced by utilizing different processors for operations having data dependen- cies. We conclude this section by noting that although our work is directed toward the parallelization of entire programs, the focus of this paper is on the parallelization of straight line 10459219/94$04.00 0 1994 IEEE