EXPLORING PERFORMANCE TRADEOFFS FOR CLUSTERED VLIW ASIPS Margarida F. Jacome, Gustavo de Veciana and Viktor Lapinskii Department of Electrical and Computer Engineering University of Texas, Austin, TX 78712 Tel: Fax: jacome,gustavo,lapinski @ece.utexas.edu VLIW ASIPs provide an attractive solution for increasingly perva- sive real-time multimedia and signal processing embedded appli- cations. In this paper we propose an algorithm to support trade-off exploration during the early phases of the design/specialization of VLIW ASIPs with clustered datapaths. For purposes of an early exploration step, we deﬁne a parameterized family of clustered dat- apaths D m n where m and n denote interconnect capacity and cluster capacity constraints on the family. Given a kernel, the pro- posed algorithm explores the space of feasible clustered datapaths and returns: a datapath conﬁguration; a binding and scheduling for the operations; and a corresponding estimate for the best achiev- able latency over the speciﬁed family. Moreover, we show how the parameters m and n, as well as a target latency optionally speciﬁed by the designer, can be used to effectively explore trade-offs among delay, power/energy, and latency. Extensive empirical evidence is provided showing that the proposed approach is strikingly effective at attacking this complex optimization problem. Real time multimedia and signal processing embedded applications often spend most of their cycles executing a few time critical code segments (kernels) with well deﬁned characteristics, making them amenable to processor specialization. Moreover, these computation intensive kernels often exhibit a high degree of inherent instruction level parallelism (ILP). Thus, Very Large Instruction Word (VLIW) Application Speciﬁc Instruction Set Processors (ASIPs) provide an attractive solution for such embedded applications. Traditionally, the datapaths of VLIW machines have been based on a single register ﬁle shared by all functional units (FUs). The central register ﬁle provides internal storage as well as switching, i.e., interconnection among the FUs and to/from the memory sys- tem. Unfortunately, this simple organization does not scale well with the large number of functional units typically required to take advantage of the ILP present in the embedded applications of in- terest. Indeed, it has been shown in [14] that, for N FUs connected to a register ﬁle, the area of the register ﬁle grows as N 3 , the delay as N 32 , and power dissipation as N 3 . In short, as the number of FUs increases, internal storage and communication between FUs quickly becomes the dominant, if not prohibitive factor, in terms of delay, power dissipation, and area. A key observation is that the delay, power dissipation and area associated with the storage organization can be dramatically re- duced by restricting the connectivity between FUs and registers, so that each FU can only read and write from/to a limited subset of registers.[14] Thus a key dimension of VLIW ASIP specialization This work is supported in part by an NSF CAREER Award MIP-9624321 an NSF Grant CCR-9901255 and by Grant ATP-003658-0649 of the Texas Higher Education Coordinating Board. is clustering, i.e., the development of datapaths comprised of clus- ters of FUs connected to local storage (the cluster’s register ﬁle). Although by moving from a centralized to a distributed register ﬁle organization one can reap signiﬁcant delay, power and area sav- ings, this type of specialization may come at a cost. One may have to transfer data among these register ﬁles (i.e., datapath clusters), possibly resulting in increased latency. More concretely, consider a family of clustered datapaths where- in each cluster has no more than a given number of FUs, irrespec- tive of type. We shall refer to this constraint as a cluster capacity constraint. Intuitively, as the cluster capacity decreases (and thus the number of ports and size of the associated register ﬁle decrease), one expects combinational delay as well as power dissipation to decrease, while the number of clock cycles (latency) required to execute a given kernel to increase. In the limit, when comparing a clustered machine to a hypothetical centralized machine with the same number of FUs, one expects to be able to sustain higher clock rates in the clustered machine, but at the cost of increased latency, due to the need to move data among register ﬁles. Moreover, as cluster capacity decreases, one also expects pow- er dissipation to decrease with respect to the centralized machine. Indeed, clustered machines would have local register ﬁles that have fewer ports and are smaller than the single register ﬁle of the cen- tralized machine, thus achieving a less costly (local) switching in- side each cluster. Unfortunately, switching may also be needed among clusters, i.e., there may be a need to perform move (or copy) operations across register ﬁles of different clusters, with a corre- sponding undesirable effect in energy consumption Note that while performance and power/energy are a major con- cern in embedded applications, silicon area (per se) is not necessar- ily a concern, since with today’s levels of integration one can cost- effectively place large numbers of transistors on a single chip[1, 13]. Thus, in exploring the design space with respect to the impact on performance and power/energy of different cluster capacities, one can allow for an unbounded number of clusters – at least dur- ing the early phases of the exploration. In fact, as observed in [5], in signal processing applications with high ILP, in order to achieve high throughput one should expect datapaths with a large number of functional resources and low resource sharing. Thus, placing an upper bound on the total number of functional resources was considered inadequate. One should however consider a constraint on the interconnect capacity, since congestion (during data trans- fers across clusters) may lead to major performance penalties, and the interconnect structure has a signiﬁcant impact on the relevant ﬁgures of merit discussed above (i.e., delay and power/energy). In summary, when considering the specialization of a datapath to a given kernel, one should seek solutions with a (possibly large) number of clusters working (quasi-) independently. Note that such conﬁgurations are the “ideal” ones, in that they decrease power and delay, by taking advantage of locality in the computations, while incurring no (signiﬁcant) latency/energy penalties due to switching