A METHODOLOGY FOR PRECISE COMPARISONS OF PROCESSOR CORE ARCHITECTURES FOR HOMOGENEOUS MANY-CORE DSP PLATFORMS B. Rousseau, Ph. Manet, I. Loiselle, J.-D. Legat Université catholique de Louvain (UCL) Laboratoire de microélectronique (DICE) Place du Levant, 3 B-1348, Louvain-la-Neuve, Belgium H. Vandierendonck Ghent University Dept. ELIS/HiPEAC St.-Pietersnieuwstraat, 41 B-9000 Gent, Belgium ABSTRACT The power efﬁciency of an HMCP heavily depends on the ar- chitecture of its processor cores. It is thus very important to choose it carefully. When comparing processing architectures for their use in a many-core platform, one must evaluate its IPC, but also its power and area. Precise power and area eval- uations can only be done with real implementations. How- ever, comparing processor implementations is a difﬁcult task since the implementation speciﬁties introduce interferences on the performances. This paper proposes a methodology that allows to realize precise comparisons of performance for dif- ferent processor architectures. Using this methodology, it is possible to choose the best architecture for an HMCP target- ing DSP applications. The methodology is based on the use of a common architural template to build the cores, and on the application of speciﬁc optimizations when relevant. In or- der to validate the methodology, three RISC cores are imple- mented: a single-issue core, and two VLIW processors with respectively 3 and 5 issues. The implemented cores are pre- cisely compared on a set of DSP kernels. Index Terms— homogeneous many-core, signal process- ing, processor architecture, power efﬁciency 1. INTRODUCTION Homogeneous many-core platforms (HMCP) are used for DSP applications. At present, those platforms use up to several hundred processor cores [1, 2]. Those cores are typi- cally RISC architectures, having single or multiple issues like VLIW processors. Thanks to their very high parallelism, they can reach very high throughputs. They also have a very high programmability level, and a good compilation support [3] compared to heterogeneous platforms like SIMD accelerators [4]. HMCPs targeting DSP applications must have a very high power efﬁciency since DSP applications have a very limited power budget. The architecture of the cores com- posing the platform has a strong inﬂuence on the platform efﬁciency, it should thus be chosen carefully. On many-core platforms, to get more performances, one can use more cores. However, adding more cores increases the platform power and area, and the amount of increase de- pends on the power and area of the cores. Different cores will lead to different platform conﬁgurations and performances. For instance, using simple cores will provide a low IPC, but their low area and power consumption allow to put many of them on an HMCP with a given power and area budget. On the contrary, using more complex cores will provide better IPC, but will also require more area and power [5]. In this case, less cores can be used with the same budget. As those examples illustrate, there is a strong interaction between the performances of an HMCP and the IPC, power and area of its cores. In order to choose the best architecture for the cores of an HMCP, besides the IPC, it is also required to compare the power and area of the candidates. To evaluate the IPC of a core, one can use a simulator, but to evaluate the power and area, it is necessary to use real implementations, like an IP or even a chip, to get precise results. However, when compar- ing different processor architectures by using speciﬁc imple- mentations, those differ on many aspects: ISA, technology, process ﬂavor, hardware optimizations or compilation opti- mizations. Each of those aspects has an inﬂuence on the core performances. In order to isolate the impact of the core archi- tecture on the platform performances, it is necessary to reduce the interferences introduced by a speciﬁc architecture imple- mentation. To enable precise and fair performance comparisons at the architectural level, this work proposes a methodology that strongly reduces the variations introduced by the speciﬁc im- plementations of the cores. The methodology is based on the use of a common architectural template to build implementa- tions of the compared cores, and on the application of speciﬁc optimizations on them when relevant. Using a template guar- antees uniform implementations between the different archi- tectures and provides shared generic implementations for the functionalities of a core. However, those generic implementa- tions could be a disadvantage for some speciﬁc architectures