EXPLORING PERFORMANCE TRADEOFFS FOR CLUSTERED VLIW ASIPS Margarida F. Jacome, Gustavo de Veciana and Viktor Lapinskii Department of Electrical and Computer Engineering University of Texas, Austin, TX 78712 Tel: Fax: jacome,gustavo,lapinski @ece.utexas.edu VLIW ASIPs provide an attractive solution for increasingly perva- sive real-time multimedia and signal processing embedded appli- cations. In this paper we propose an algorithm to support trade-off exploration during the early phases of the design/specialization of VLIW ASIPs with clustered datapaths. For purposes of an early exploration step, we define a parameterized family of clustered dat- apaths D m n where m and n denote interconnect capacity and cluster capacity constraints on the family. Given a kernel, the pro- posed algorithm explores the space of feasible clustered datapaths and returns: a datapath configuration; a binding and scheduling for the operations; and a corresponding estimate for the best achiev- able latency over the specified family. Moreover, we show how the parameters m and n, as well as a target latency optionally specified by the designer, can be used to effectively explore trade-offs among delay, power/energy, and latency. Extensive empirical evidence is provided showing that the proposed approach is strikingly effective at attacking this complex optimization problem. Real time multimedia and signal processing embedded applications often spend most of their cycles executing a few time critical code segments (kernels) with well defined characteristics, making them amenable to processor specialization. Moreover, these computation intensive kernels often exhibit a high degree of inherent instruction level parallelism (ILP). Thus, Very Large Instruction Word (VLIW) Application Specific Instruction Set Processors (ASIPs) provide an attractive solution for such embedded applications. Traditionally, the datapaths of VLIW machines have been based on a single register file shared by all functional units (FUs). The central register file provides internal storage as well as switching, i.e., interconnection among the FUs and to/from the memory sys- tem. Unfortunately, this simple organization does not scale well with the large number of functional units typically required to take advantage of the ILP present in the embedded applications of in- terest. Indeed, it has been shown in [14] that, for N FUs connected to a register file, the area of the register file grows as N 3 , the delay as N 32 , and power dissipation as N 3 . In short, as the number of FUs increases, internal storage and communication between FUs quickly becomes the dominant, if not prohibitive factor, in terms of delay, power dissipation, and area. A key observation is that the delay, power dissipation and area associated with the storage organization can be dramatically re- duced by restricting the connectivity between FUs and registers, so that each FU can only read and write from/to a limited subset of registers.[14] Thus a key dimension of VLIW ASIP specialization This work is supported in part by an NSF CAREER Award MIP-9624321 an NSF Grant CCR-9901255 and by Grant ATP-003658-0649 of the Texas Higher Education Coordinating Board. is clustering, i.e., the development of datapaths comprised of clus- ters of FUs connected to local storage (the cluster’s register file). Although by moving from a centralized to a distributed register file organization one can reap significant delay, power and area sav- ings, this type of specialization may come at a cost. One may have to transfer data among these register files (i.e., datapath clusters), possibly resulting in increased latency. More concretely, consider a family of clustered datapaths where- in each cluster has no more than a given number of FUs, irrespec- tive of type. We shall refer to this constraint as a cluster capacity constraint. Intuitively, as the cluster capacity decreases (and thus the number of ports and size of the associated register file decrease), one expects combinational delay as well as power dissipation to decrease, while the number of clock cycles (latency) required to execute a given kernel to increase. In the limit, when comparing a clustered machine to a hypothetical centralized machine with the same number of FUs, one expects to be able to sustain higher clock rates in the clustered machine, but at the cost of increased latency, due to the need to move data among register files. Moreover, as cluster capacity decreases, one also expects pow- er dissipation to decrease with respect to the centralized machine. Indeed, clustered machines would have local register files that have fewer ports and are smaller than the single register file of the cen- tralized machine, thus achieving a less costly (local) switching in- side each cluster. Unfortunately, switching may also be needed among clusters, i.e., there may be a need to perform move (or copy) operations across register files of different clusters, with a corre- sponding undesirable effect in energy consumption Note that while performance and power/energy are a major con- cern in embedded applications, silicon area (per se) is not necessar- ily a concern, since with today’s levels of integration one can cost- effectively place large numbers of transistors on a single chip[1, 13]. Thus, in exploring the design space with respect to the impact on performance and power/energy of different cluster capacities, one can allow for an unbounded number of clusters – at least dur- ing the early phases of the exploration. In fact, as observed in [5], in signal processing applications with high ILP, in order to achieve high throughput one should expect datapaths with a large number of functional resources and low resource sharing. Thus, placing an upper bound on the total number of functional resources was considered inadequate. One should however consider a constraint on the interconnect capacity, since congestion (during data trans- fers across clusters) may lead to major performance penalties, and the interconnect structure has a significant impact on the relevant figures of merit discussed above (i.e., delay and power/energy). In summary, when considering the specialization of a datapath to a given kernel, one should seek solutions with a (possibly large) number of clusters working (quasi-) independently. Note that such configurations are the “ideal” ones, in that they decrease power and delay, by taking advantage of locality in the computations, while incurring no (significant) latency/energy penalties due to switching