Multi-Science Applications with Single Codebase - GAMER - for Massively Parallel Architectures Hemant Shukla Lawrence Berkeley National laboratory 1 Cyclotron Road Berkeley, USA hshukla@lbl.gov Hsi-Yu Schive Department of Physics, National Taiwan University 106, Taipei Taiwan b88202011@ntu.edu.tw Tak-Pong Woo Department of Physics, Soochow University 106, Taipei Taiwan Tzihong Chiueh Department of Physics, National Taiwan University 106, Taipei Taiwan ABSTRACT The growing need for power eﬃcient extreme-scale high- performance computing (HPC) coupled with plateauing clock- speeds is driving the emergence of massively parallel com- pute architectures. Tens to many hundreds of cores are in- creasingly made available as compute units, either as the integral part of the main processor or as coprocessors de- signed for handling massively parallel workloads. In the case of many-core graphics processing units (GPUs) hundreds of SIMD cores primarily designed for image and video render- ing are used for high-performance scientiﬁc computations. The new architectures typically oﬀer ANSI standard pro- gramming models such as CUDA (NVIDIA) and OpenCL. However, the wide-ranging adoption of these parallel archi- tectures is steeped in diﬃcult learning curve and requires reengineering of existing applications that mostly leads to expensive and error prone code rewrites without prior guar- antee and knowledge of any speedups. Broad range of complex scientiﬁc applications across many domains use common algorithms and techniques, such as adaptive mesh reﬁnements (AMR), advanced hydrodynam- ics partial diﬀerential equation (PDE) solvers, Poisson-Gravity solvers etc, that have demonstrably performed highly eﬃ- ciently on GPU based systems. Taking advantage of the commonalities, we use GPU-aware AMR code, GAMER [1], to examine the unique approach of solving multi-science problems in astrophysics, hydrodynamics and particle physics with single codebase. We demonstrate signiﬁcant speedups in disparate class of scientiﬁc applications on 3 separate clus- ters, viz., Dirac, Laohu and Mole 8.5. By extensively reusing the extendable single codebase we mitigate the impediments Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. of signiﬁcant code rewrites. We also collect performance and energy consumption benchmark metrics on 50-nodes NVIDIA C2050 GPU and Intel 8-core Nehalem CPU on Dirac cluster at the National Energy Research Supercom- puting Center (NERSC). In addition, we propose a strategy and framework for legacy and new applications to success- fully leverage the evolving GAMER codebase on massively parallel architectures. The framework and the benchmarks are aimed to help quantify the adoption strategies for legacy and new scientiﬁc applications. Keywords GPU, AMR, hydrodynamics, Poisson-Gravity solvers, sim- ulations, benchmarks 1. INTRODUCTION Developing highly eﬃcient power consumption solutions for computational architectures is one of the many big chal- lenges for the future of high-performance computing (HPC) accelerating towards exascale. As of November 2010 the top Green 500 machine IBM Blue Gene/Q prototype yields 1684.20 MFLOP/s/W with total consumption of 38.8 kW. With this eﬃciency an exaFLOP/s performance will require about a giga-Watt of energy same as the peak output of an average nuclear power plant. It is quite evident that at current energy performance levels exascale computing will simply be not feasible. However, based upon the same Green 500 estimates, an interesting trend to note is that the accelerator-based clusters are 3.5 times more eﬃcient than the traditional clusters. Taking heed of the looming energy impasse diﬀerent vendors are proposing variety of trajectories for architecture roadmaps that have massively parallel compute engines on single chips. The multi-core trajectories tend to add more cores to the central process- ing units (CPUs). New Intel i7 processors have 4 proces- sor cores. Next generation are slated to have 6-8 cores. In contrast, many-core trajectory oﬀers a very large num- ber of lightweight processing cores. Latest NVIDIA Fermi GPU has 512 cores. The GPUs are originally designed for image and video rendering primarily for the gaming indus-