IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 11, NOVEMBER 1984 Measurng the Parallelism Available for Very Long Instruction Word Architectures ALEXANDRU NICOLAU AD JOSEPH A. FISHER Abstract -Long instruction word architectures, such as at- tached scientific processors and horizontally microcoded CPU's, are a popular means of obtaining code speedup via fine-grained parallelism. The failing cost of hardware holds out the hope of using these architectures for much more parallelism. But this hope has been diminished by experiments measuring how much parallelism is available in the code to start with. These experi- ments implied that even if we had -infinite hardware, long in- struction word architectures could not provide a speedup of more than a factor of 2 or 3 on real programs. These experiments measured only the parallelism within basic blocks. Given the machines that prompted them, it made no sense to measure anything else. Now it does. A recently developed code compaction technique, called trace scheduling (9], could exploit parallelism in operations even hundreds of blocks apart. Does such parallelism exist? .In this paper we show that it does. We did analogous experi- ments, but we disregarded basic block boundaries. We found huge amounts of parallelism available. Our measurements were made on standard Fortran programs in common use. The actual pro- grams tested averaged about a factor of 90 parallelism. It ranged from about a factor of 4 to virtually unlimited amounts, restricted only by the size of the data. An important question is how much of this parallelism can actually be found and used by a real code generator. In the experi- ments, an oracle is used to resolve dynamic questions at compile time. It tells us which way jumps went and whether indirect references are to the same or different locations. Trace scheduling attempts to get the effect of the oracle at compile time with static index analysis and dynamic estimates of jump probabilities. We argue that most scientific code is so static that the oracle is fairly realistic. A real trace-scheduling code generator [7] might very well be able to find and use much of this parallelism. Index Terms -Memory antialiasing, microcode, multi- processors, parallelism, trace scheduling, VLIW (very long in- struction word) architectures. I. INTRODUCTION IN this paper we describe experiments we have done to empirically measure the maximum parallelism available to very long instruction word (VLIW) architectures. The most familiar examples of VLIW architectures are horizontally microcoded CPU's and some very popular specialized scientific processors, such as the floating point systems AP-120b and FPS-164. Very long instruction word architectures take advantage of fine-grained parallelism to speed up execution time. Manuscript received February 17, 1984; revised July 13, 1984. This work was supported in part by the National Science Foundation under Grants MCS-81-06181 and MCS-81-07646, in part by the Office of Naval Research under Grant N00014-82-K-0184, and in part by the Army Research Office under Grant DAAG29-81-K-0171. A. Nicolau is with the Department of Computer Science, Cornell University, Ithaca, NY 14853. J. A. Fisher is with the Department of Computer Science, Yale University, New Haven, CT 06520. However, in contrast to vectot machines and traditional multiprocessors, no currently available machines use this architecture for great amounts of parallelism. A user in any practical environment is doing well if he obtains a factor of 2 or 3 speedup over sequential execution. Why are these machines not dramatically more parallel? One probable rea- son is that the popular wisdom has it that a factor of 2 or 3 is all the fine-grained parallelism that is there to exploit. A chief contributor to this belief was a set of experiments done in the early 1970's. They measured the fine-gr"ained parallelism available under the hypothesis that there was infinite hard- ware available to execute whatever parallelism was found. Unfortunately, all they could find was a factor of 2 or 3. We believe these experiments, done for a somewhat differ- ent domain than VLIW architectures, were far too pessi- mistic. We will explain why shortly. In this paper we report on experitnents we have done which we think more directly address this question for VLIW architectures. A. Very Long Instruction Word Architectures The defining properties of VLIW architectures are: 1) there is one central control unit issuing a single wide instruction per cycle; 2) each wide instruction consists of many independent operations; 3) each operation requires a small statically predictable number of cycles to execute. Operations may be pipelined. Restrictions 1) and 3) distinguish these from typical multi- processor organizations.1 Since it is nearly impossible to tightly couple very many highly complex operations, the underlying sequential archi- tecture of a VLIW will invariably be a reduced instruction set computer or RISC [16]. Thus, the instruction set will typically consist of register to register operations, with memory references being simple loads/stores without complex addressing modes. This will greatly simplify the scheduling patt of the compiler. VLIW machines might have large numbers of identical functional units. When they do, we do not require. that they be connected by some regular and concise scheme such as shuffles or cube connections. A tabular description of the somewhat ad hoc interconnections suffices for our purposes.2 This makes the use of VLIW machines very different from machines with regular interconnection structures and/or com- plex hardware data structures. 'VLIW architectures do not fit neatly into many of the taxonomies of parallel processor organization. 2We rely heavily on the compiler to schedule data movements and (when feasible) parallel memory fetches as well as operation execution. Thus, the interconnections between the various processing units need not be very regular, as we expect only the compiler (not humans) to have to deal with them. 0018-9340/84/1100-0968$01.00 C 1984 IEEE 968