What Can We gain by Unfolding Loops? Litong Song, Krishna Kavi Department of Computer Science, University of North Texas, Denton, Texas, 76203, USA ABSTRACT Loops in programs are the source of many optimiza- tions for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unroll- ing and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are “well-structured” and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to “well struc- tured” loops. In many cases, even “badly-structured” loops may be transformed into “well structured” loops. As a case in point, we show how some loop-dependent code can be transformed into loop- independent code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling and so on. key words: loop quasi invariant code motion, loop peeling, loop unrolling, quasi-invariant variable, quasi-index variable. 1 Introduction Loops in programs are the source of many optimizations for improving program performance, particularly on modern high- performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop peeling and loop unrolling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are “well-structured” and easy to analyze. For instance, loop in- variant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the loop indices and array references are either constant or affine functions. Let us first give a brief review on a few com- mon loop optimization techniques such as loop invariant code motion, loop unrolling and loop peeling, and discuss the limi- tations of these techniques. 1.1 Overview of Three Loop Optimization Techniques Loop invariant code motion is a well-known loop transforma- tion technique used by optimizing compilers. When a compu- tation in a loop does not change during the dynamic execution of the loop, we can hoist this computation out of the loop to improve execution time performance. Modern computer systems exploit both instruction level parallelism (ILP) and thread (or task) level parallelism (TLP). Superscalar and VLIW systems rely on ILP while multi- threaded and multiprocessor systems rely on TLP. In order to fully benefit from ILP or TLP, compilers must perform com- plex analyses to identify and schedule code for the architec- tures. Typically compilers focus on loops for finding parallel- ism in programs [26] [27] [28]. Sometimes it is necessary to rewrite (or reformat) loops such that loop iterations become independent of each other, permitting parallelism. Loop peel- ing is one such technique [3] [15] [21]. When a loop is peeled, a small number of early iterations are removed from the loop body and executed separately (before the start of the loop). The main purpose of this technique is for removing dependencies created by the early iterations on the remaining iterations, thereby enabling parallelization. In cases where an array is used to simulate a cylindrical coordinate system where the left edge of the array must be adjacent to its right edge, a method known as a wraparound variable is often used. The loop in Fig. 1(a) is not parallelizable because variable wrap is neither a constant nor a linear function of index variable i. Peeling off the first iteration allows the rest of loop to be vectorizable, as shown in Fig. 1(b). The loop in Fig. 1(b) can be rewritten as: a(2: n) = b(2: n) + b(1: n-1). for (i = 1; i <= n; i++) { a[i] = b[i] + b[wrap]; wrap = i; } (a) A source loop if (1 <= n) { a[1] = b[1] + b[wrap]; wrap = i; } for (i = 2; i <= n; i++) { a[i] = b[i] + b[i-1]; } (b) The resulting code after peeling first iteration Fig. 1. The second example for loop peeling Loop unrolling is a technique, which replicates the body of a loop a number of times called the unrolling factor u and iter- ates by step u instead of step 1. It is a fundamental technique for generating efficient instructions required to exploit ILP on such architectures as VLIW and Superscalars. Loop unrolling can improve the performance by (i) reducing loop overhead; (ii) increasing instruction level parallelism; (iii) improving register, data cache, or TLB locality. Fig. 2 shows an example of loop unrolling, Loop overhead is cut in half because an additional iteration is performed before the test and branch at the end of the loop. Instruction parallelism is increased be- cause the first and second assignments can be executed in parallel. for (i = 2; i <= n; i++) { a[i] = a[i-2] + b[i]; } (a) A source loop for (i = 2; i <= n-1; i = i+2) { a[i] = a[i-2] + b[i]; a[i+1] = a[i-1] + b[i+1]; } if (mod(n-2, 2) == 1) { a[n] = a[n-2] + b[n]; } (b) The resulting code after loop unrolling Fig. 2. An example of loop unrolling ACM SIGPLAN Notices 26 vol 39(2) Feb 2004