J. Parallel Distrib. Comput. 64 (2004) 578–590 Data dependent loop scheduling based on genetic algorithms for distributed and shared memory systems Jose L. Aguilar a, * and Ernst L. Leiss b a CEMISID. Dpto. de Computacio´n, Facultad de Ingenierı´ a, Universidad de los Andes, Merida 5101, Venezuela b Department of Computer Science, University of Houston, Houston, TX 77204-3010, USA Received 18 December 2000; revised 10 June 2003 Abstract Many approaches have been described for the parallel loop scheduling problem for shared-memory systems, but little work has been done on the data-dependent loop scheduling problem (nested loops with loop carried dependencies). In this paper, we propose a general model for the data-dependent loop scheduling problem on distributed as well as shared memory systems. In order to achieve load balancing and low runtime scheduling and communication overhead, our model is based on a loop task graph and the notion of critical path. In addition, we develop a heuristic algorithm based on our model and on genetic algorithms to test the reliability of the model. We test our approach on different scenarios and benchmarks. The results are very encouraging and suggest a future parallel compiler implementation based on our model. r 2004 Elsevier Inc. All rights reserved. Keywords: Loop scheduling; Loop-carried dependence; Parallel algorithms; Genetic algorithms; Performance optimization 1. Introduction Generally, loops are the richest source of parallelism in parallel applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors, thereby reducing the running time. Consider the case in which a loop of m iterations is executed in a multiprocessor system with P processors. The goal of scheduling is to distribute these m iterations to P processors in the most equitable manner and with the least amount of overhead. In the case where there are no data dependencies between tasks in different iterations (i.e., parallel loops) there is no need for synchronization. If there are data dependencies between tasks in different iterations, the communication delay may slow down the execution. Previous approaches have attempted to achieve the minimum completion time for the parallel loop scheduling problem only by distribut- ing the workload as evenly as possible while minimizing the number of synchronization operations required and the communication overhead caused by access to non local data on shared-memory systems [2,3,5,6,11,18]. Other authors have studied the parallelism across iterations to consider loop carried dependencies, pro- posing different techniques to improve this parallelism [4,8,12–14,16,17]: cyclo-compaction scheduling, loop pipelining, etc. In this paper, we study the problem of scheduling a set of n nested loops, with data dependencies among the loops/iterations (that is, with loop carried dependen- cies), on distributed or shared memory machines. Our technique falls in the static scheduling and software pipelining category. In the presence of data dependen- cies between tasks in different iterations, we need a better representation than the traditional task graph model to be able to represent the data dependencies. We represent the data dependencies among tasks in loops using a more specific type of task graph, the loop task graph, whose nodes represent the tasks on different iterations and whose arcs represent the dependence relationship between tasks. We also need a scheduling approach that can exploit parallelism within each iteration and among different loop iterations. We solve this problem using the loop unrolling technique and the critical path concept [1,5]. The basic idea is to unroll the loop to allow several iterations and tasks in the same ARTICLE IN PRESS *Corresponding author. Fax: +587-440-2872. Email-addresses: aguilar@cemisid.ing.ula.ve (J.L. Aguilar), coscel@cs.uh.edu (E.L. Leiss). 0743-7315/$-see front matter r 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2004.03.004