Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing: Exploring Programming Models and Their Architectural Support Francesco Poletti, Antonio Poggiali, Davide Bertozzi, Luca Benini, Pol Marchal, Mirko Loghi, and Massimo Poncino Abstract—In today’s multiprocessor SoCs (MPSoCs), parallel programming models are needed to fully exploit hardware capabilities and to achieve the 100 Gops/W energy efficiency target required for Ambient Intelligence Applications. However, mapping abstract programming models onto tightly power-constrained hardware architectures imposes overheads which might seriously compromise performance and energy efficiency. The objective of this work is to perform a comparative analysis of message passing versus shared memory as programming models for single-chip multiprocessor platforms. Our analysis is carried out from a hardware-software viewpoint: We carefully tune hardware architectures and software libraries for each programming model. We analyze representative application kernels from the multimedia domain, and identify application-level parameters that heavily influence performance and energy efficiency. Then, we formulate guidelines for the selection of the most appropriate programming model and its architectural support. Index Terms—MPSoCs, embedded multimedia, programming models, task-level parallelism, energy efficiency, low power. Ç 1 INTRODUCTION T HE traditional dichotomy between shared memory and message passing as programming models for multi- processor systems has consolidated into a well-accepted partitioning. For small-to-medium scale multiprocessor systems, there is an undisputed consensus on cache-coherent architectures based on shared memory. In contrast, large- scale high-performance multiprocessor systems have con- verged toward nonuniform memory access (NUMA) archi- tectures based on message passing (MP) [3], [4]. The appearance of Multi-Processor Systems-on-Chip (MPSoCs) in the multiprocessing scenario, however, has somehow brought this picture into discussion. In fact, several peculiarities differentiate these architectures from classical multiprocessing platforms. First, their “on-chip” nature reduces the cost of interprocessor communication. The cost of sending a message on an on-chip bus is, in fact, at least one order of magnitude lower (power and performance-wise) than that of an off-chip bus, thus pushing toward message passing-based programming models. On the other hand, the cost of on-chip memory accesses is also smaller with respect to off-chip memories; this makes cache-coherent architectures based on shared memory competitive. Second, MPSoCs are resource-constrained systems. This implies that, while performance is still critical, other cost metrics, such as power consumption, must be considered. Unfortunately, it is not usually possible to optimize power and performance concurrently and one quantity must typically be traded off against the other one. Third, unlike traditional message passing systems, some MPSoC architectures are highly heterogeneous. For in- stance, some platforms are a mix of standard processor cores and application-specific processors such as DSPs or microcontrollers [19], [8]. Conversely, other platforms are highly modular and reminiscent of traditional multipro- cessor architectures [22], [24]. While, in the former case, message-passing is the only viable alternative (some of the processing engines may even be cacheless), in the latter case, a cache-coherence model seems to be the most intuitive choice. All of these issues indicate that the choice between the two programming models is not so well-defined for MPSoCs. The objective of this work is precisely that of exploring what factors may affect this choice, yet from a novel and more exhaustive perspective. Although our analysis considers the two traditional dimensions of the problem, namely, the architecture and the software, they are both considered from the software perspective. In particular, we assume that the 606 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 5, MAY 2007 . F. Poletti and L. Benini are with DEIS, University of Bologna, Viale Risorgimento 2/2, 40100 Bologna (BO), Italy. E-mail: {fpoletti, lbenini}@deis.unibo.it. . A. Poggiali is with STMicroelectronics, Centro Direzionale Colleoni, via Cardano 2-palazzo Dialettica, 20041 Agrate Brianza (MI), Italy. E-mail: antonio.poggiali@st.com. . D. Bertozzi is with the Engineering Department, University of Ferrara, Via Saragat, 1, 44100 Ferrara (FE), Italy. E-mail: dbertozzi@ing.unife.it. . P. Marchal is with ESAT KULeuven-IMEC vzw, Kapeldreef 75, 3001 Heverlee, Belgium. E-mail: marchal@imec.be. . M. Loghi and M. Poncino are with the Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy. E-mail: {mirko.loghi, massimo.poncino}@polito.it. Manuscript received 25 Aug. 2005; revised 26 May 2006; accepted 10 Sept. 2006; published online 6 Mar. 2007. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-0283-0805. Digital Object Identifier no. 10.1109/TC.2007.1040. 0018-9340/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society