Achieving High Memory Performance from Heterogeneous Architectures with the SARC Programming Model Roger Ferrer Barcelona Supercomputing Center, Universitat Politècnica de Catalunya, Barcelona, Spain roger.ferrer@bsc.es Vicenç Beltran Barcelona Supercomputing Center, Universitat Politècnica de Catalunya, Barcelona, Spain vbeltran@bsc.es Marc González Barcelona Supercomputing Center, Universitat Politècnica de Catalunya, Barcelona, Spain marc.gonzalez@bsc.es Xavier Martorell Barcelona Supercomputing Center, Universitat Politècnica de Catalunya, Barcelona, Spain xavier.martorell@bsc.es Eduard Ayguadé Barcelona Supercomputing Center, Universitat Politècnica de Catalunya, Barcelona, Spain eduard.ayguade@bsc.es ABSTRACT Currentheterogeneousmulticorearchitectures,includingthe Cell/B.E., GPUs, and future developments, like Larrabee, requireenormousprogrammingeffortstoefficientlyruncur- rent parallel applications, achieving high performance. In thispaper,wewanttopresenttheresultsweobtainfromthe coding with the SARC Programming Model, of two bench- marks, matrix multiply and conjugate gradient (NAS CG), with respect memory bandwidth. We show some sample loops annotated and the experience that we got trying to have our system executing them efficienly. Results indicate thattheprogrammingmodelisabletoachieveupto85%of the peak memory bandwidth on the Cell/B.E. processor. 1. INTRODUCTION Heterogeneousmulticorearchitecturesavailabletoday[7,26, 18]addanextradifficultytoprogramming,andatthesame timeachieveclosetopeakperformance. Algorithmshaveto be adapted, both in their computation and communication schemes to fully exploit the underlying architecture. Programmers have been struggling to achieve high perfor- mance in heterogeneous multicore architectures. For that, they have to write code in different ways, with respect to whattheyareusedto.Itiscommonnowtoprogramontop of a vendor provided SDK on various architectures, like the Cell BE processor[7] and the NVIDIA GPU cards[17, 18]. In this work, we want to present the results of the eval- uation of the SARC Programming Model with respect to memory performance. We use extensions to OpenMP to program [2] heterogeneous architectures, like the Cell/B.E. processor,exploitingtheirlocalmemoriesinanefficientand easy way. We have developed the transformations in our Mercurium C/C++compilationinfrastructure[12],andwehaveapplied the techniques to matrix multiply, and conjugate gradient (NASCG).WehavecodedthemintheSARCprogramming model, showing that we can achieve far more productivity thanusingSDKprogramming. Weanalyzethememoryper- formance that we get in all of them. Results show that the transformations are feasible, and our system achieves good communication performance. They also show that in most cases, it is the computation needed in those applications what limits the final performance obtained. In some cases, compiler techniques like vectorization would really help ob- taining better overall performance. The rest of the paper is organized as follows: Section 2 presentsrelevantrelatedwork. Section3presentshowsome parts of the algorithms are written using the SARC pro- gramming model. Section 4 presents the evaluation of the benchmarks. Section 5 concludes the paper and Section 6 outlines future work. 2. RELATED WORK Heterogeneous architectures have attracted the attention of severalstudiesabouttheirmemoryperformance,anddeter- mine their capabilities to sustain high memory bandwidth. Jimenez et al. [15] did an analysis of the Cell BE processor with respect to memory performance. They show that code running on the SPEs should exploit loop unrolling, double buffering, DMA lists, and delay as much as possible the synchronizations with respect the DMA transfers. In this paper, we use loop unrolling and blocking techniques, but our compiler is not able yet to exploit double buffering or DMA lists. The IBM compiler [11] targeting the Cell BE processor, also exploits loop blocking, unrolling and double 15 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MEDEA'09,Sept. 13th, 2009, Raleigh, North Carolina, USA. (C) 2009 ACM 978-1-60558-830-8/09/09...$5.00.