STAGGERED SCHEME: A LOOP ALLOCATION POLICY A. R. Hurson*, Joford T. Lira*, B. Shirazi**, and K. Kavi** *The Pennsylvania State University **The University of Texas at Arlington Comp. Science and Engineering Dept. Comp. Science and Engineering Dept. University Park, PA 16802 Arlington, TX 76019 Abstract. The run-time overhead of detection and allocation of dynamic parallelism in a program can easily offset the performance gain. To improve the performance and reduce run-time overhead, it would be necessary to develop an allocation scheme that detects dynamic parallelism during compile-time. However, the difficult task of accurate estimation of the run-time parallelism is a stumbling block to this direction. As a compromise, we propose an allocation policy which: (i) detect dynamic parallelism for a selected group of program constructs during compile-time and, (ii) allocates them to the estimated hardware resources in a staggered fashion using a set of heuristic rules. I. INTRODUCTION The dataflow model of computation was proposed as an alternative to the conventional control-flow model of computation. It explicitly addresses the issue of programmability, memory latency and synchronization. This paper is mainly interested in the problem of detection and allocation of dynamic parallelism in a multithreaded datafiow architecture. Similar to the control-flow multiprocessors, the issue of task partitioning and allocation is also of major interest to multithreaded dataflow architectures. The goal is to exploit the maximum concurrency of a program graph by minimizing contention and communication for processing resources. Since loops are the largest source of dynamic parallelism, and recnrsion can be converted to loops, the paper is devoted to investigating dynamic parallelism within the scope of loops. A scheme to allocate doacross loops called Staggered distribution is introduced and its effectiveness is simulated and analyzed. II. DYNAMIC PARALLELISM In our model, a dataflow graph is used to represent the body of a loop construct -- nodes represent the instructions and the arcs represent the data dependence among the nodes. If node i of one iteration (/) is data dependent on the result of nodej of the preceding iteration (l - 1) and node i precedes node j, then there is a lexieally backward dependency (LBD) between iteration I and I - 1. Since there may be several data dependence between iterations, the value of t(i, j) (the sum of the execution time of all the nodes from i toj that contribute to the final result) would be the one that has the longest path length. If the distance of the LBD (L) is greater than one, the loop is partitioned into n/L independent loops. These partitions can then be allocated to the available processors P, with each partition assigned to P/(n/L) equal number of processors. IIL STAGGERED DISTRIBUTION SCHEME The number of iterations in a loop can be either determined at compile-time or during run-time. In the first case, the number of iterations is fixed and known in advance. Hence, it can be unrolled and allocated using Staggered distribution. For the latter case, unrolling during compile-time could result in inefficient resource utilization. A partial allocation/distribution instead is performed during compile-time, and the Staggered distribution is later applied during run-time. IliA. Distribution/Allocation of the Unrolled Graph In this case, we assume that the delay t(i, j) is known or can be determined during compile-time. If k is the fraction of the delay t(i, j) to the execution time of an iteration T, k = t(i, j)/T, then the fraction of T that can be executed concurrently in all iterations is (1 - k). In a dataflow processor, each instruction constitutes an independent thread, and only non-suspended threads are scheduled to be executed -- a processor can switch between different iterations as long as some nodes are enabled. Therefore, if loop iterations are evenly distributed among processors, then it would take the same amount of time to fmish executing the (1 - k) fractions of the assigned loop iterations. But, each processor PE i (l<i_<n) has to wait for processor PEi-1 to finish executing the k fractions and send the partial results to allow processor PEi to continue