1/11 Exploiting Parallelism with Dependence-Aware Scheduling Xiaotong Zhuang, Alexandre E Eichenberger, Yangchun Luo*, Kevin O’Brien, Kathryn O’Brien IBM T.J. Watson Research, Yorktown Heights, NY {xzhuang, alexe, caomhin, kmob}@us.ibm.com *Dept. Computer Science, Univ. of Minnesota, MN yluo@cs.umn.edu Abstract 1 It is well known that a large fraction of applications cannot be parallelized at compile time because of unpredictable data dependences due to indirect memory accesses and/or memory accesses guarded by data-dependent conditional statements. A significant body of prior work attempts to parallelize such applications using runtime data-dependence analysis and scheduling. Performance is highly dependent on the ratio of the dependence analysis overheads with respect to the actual amount of parallelism available in the code. We have found that the overheads are often high and the available parallelism is often low when evaluating applications on a modern multicore processor. We propose a novel software-based approach called dependence-aware scheduling to parallelize loops with unknown data dependences. Unlike prior work, our main goal is to reduce the negative impact of dependence computation, such that when there is not an opportunity of getting speedup, the code can still run without much slowdown. If there is an opportunity, dependence-aware scheduling is able to yield very impressive speedup. Our results indicate that dependence-aware scheduling can greatly improve performance, with up to 4x speedups, for a number of computation intensive applications. Furthermore, the results also show negligible slowdowns in a stress test, where parallelism is continuously detected but not exploited. Keywords: Partial Parallelism, Runtime Dependence Analysis, Inspector/Executor, Multicore, Thread Scheduling. 1. Introduction Current parallelizing compilers aim to extract parallelism from programs with regular, statically analyzable memory access patterns. However, there are a significant number of applications that have memory access patterns that are not readily analyzable at compile time. For instance, pointer dereferences or indirect memory references are often difficult to capture through static analysis; array subscripts can involve computations that cannot be resolved statically; and conditional branches can selectively expose or hide memory accesses, leading to dynamic memory access sequences. Thread Level Speculation (TLS) [2][3][4][5][23] aims to parallelize codes with potential memory access conflicts using dedicated hardware mechanisms to detect data-dependence conflicts and to roll back state when violations occur. TLS does not require intensive compiler analysis, either statically or at runtime, and can be applied to arbitrary code. However, it 1 This research is funded by US government contract number B554331. requires specialized hardware support that is not yet available on commodity multicores. Software-based approaches for Thread Level Speculation have been proposed [16][17][18][19][20][21][22], where data- dependence conflict detection mechanisms are purely implemented in software. When conflicts are detected, the state of one or more threads is rolled back, again using software. Compiler techniques can be used to optimize and reduce the conflict checks and rollbacks. While software TLS can be successful, its overheads make it generally ill suited for applications that have frequent, unpredictable dependences among consecutive iterations. A significant body of work [6][7][8][9][10][11][12][13] [14][15] has proposed runtime techniques based on detecting data-dependences and exploiting available parallelism by scheduling work accordingly. Like TLS, the performance of such techniques is highly dependent on the ratio of the cost associated with data dependence analysis and the benefit achieved by exploiting parallelism. Unfortunately, many applications do not currently benefit from these techniques on current multicores, because dependence computation overheads are often too high compared to the sequential execution time of the original code, and/or the applications exhibit only limited amounts of parallelism. The causes for high overheads can be classified as follows. First, dependence analysis often utilizes large data structures to precisely track all (possible cross-iteration) memory references and/or memory location touched by a loop, resulting in memory footprints that can be much larger than that of the original code. Since today’s multi-threaded multicores share large fractions of their memory hierarchy, they are particularly sensitive to larger memory footprints. Second, the computing of dependences often involves synchronization and/or expensive sorting algorithms. Third, sometimes the actual computations only start once data dependence computations are completed. As a result of these high overheads, prior work is only applicable to loops with data dependences that are complicated enough so that the compiler cannot analyze them at compile time, but simple enough so that the dependence overheads are not too high. Also, the loop must run long enough, and be parallel enough, so as to amortize the initialization overhead. As a result, these drawbacks can greatly offset the benefits or even cause significant slowdown when dealing with code without prior knowledge on their dependences. In this paper, we present a novel software-based approach called dependence-aware scheduling to parallelize code with unknown dependences. Our main goal is to significantly reduce the negative impact of dependence computation, so that (1) when there is no parallelism, the code can still run with nearly no slowdown; and (2) when there is parallelism, dependence-aware scheduling can yield significant speedups. In our approach, a main thread runs the original code sequentially, without regards to the results of the dependence computation. Meanwhile, a number of worker threads are used to calculate dependences. A slice function is derived from the