Multi-Science Applications with Single Codebase - GAMER - for Massively Parallel Architectures Hemant Shukla Lawrence Berkeley National laboratory 1 Cyclotron Road Berkeley, USA hshukla@lbl.gov Hsi-Yu Schive Department of Physics, National Taiwan University 106, Taipei Taiwan b88202011@ntu.edu.tw Tak-Pong Woo Department of Physics, Soochow University 106, Taipei Taiwan Tzihong Chiueh Department of Physics, National Taiwan University 106, Taipei Taiwan ABSTRACT The growing need for power efficient extreme-scale high- performance computing (HPC) coupled with plateauing clock- speeds is driving the emergence of massively parallel com- pute architectures. Tens to many hundreds of cores are in- creasingly made available as compute units, either as the integral part of the main processor or as coprocessors de- signed for handling massively parallel workloads. In the case of many-core graphics processing units (GPUs) hundreds of SIMD cores primarily designed for image and video render- ing are used for high-performance scientific computations. The new architectures typically offer ANSI standard pro- gramming models such as CUDA (NVIDIA) and OpenCL. However, the wide-ranging adoption of these parallel archi- tectures is steeped in difficult learning curve and requires reengineering of existing applications that mostly leads to expensive and error prone code rewrites without prior guar- antee and knowledge of any speedups. Broad range of complex scientific applications across many domains use common algorithms and techniques, such as adaptive mesh refinements (AMR), advanced hydrodynam- ics partial differential equation (PDE) solvers, Poisson-Gravity solvers etc, that have demonstrably performed highly effi- ciently on GPU based systems. Taking advantage of the commonalities, we use GPU-aware AMR code, GAMER [1], to examine the unique approach of solving multi-science problems in astrophysics, hydrodynamics and particle physics with single codebase. We demonstrate significant speedups in disparate class of scientific applications on 3 separate clus- ters, viz., Dirac, Laohu and Mole 8.5. By extensively reusing the extendable single codebase we mitigate the impediments Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. of significant code rewrites. We also collect performance and energy consumption benchmark metrics on 50-nodes NVIDIA C2050 GPU and Intel 8-core Nehalem CPU on Dirac cluster at the National Energy Research Supercom- puting Center (NERSC). In addition, we propose a strategy and framework for legacy and new applications to success- fully leverage the evolving GAMER codebase on massively parallel architectures. The framework and the benchmarks are aimed to help quantify the adoption strategies for legacy and new scientific applications. Keywords GPU, AMR, hydrodynamics, Poisson-Gravity solvers, sim- ulations, benchmarks 1. INTRODUCTION Developing highly efficient power consumption solutions for computational architectures is one of the many big chal- lenges for the future of high-performance computing (HPC) accelerating towards exascale. As of November 2010 the top Green 500 machine IBM Blue Gene/Q prototype yields 1684.20 MFLOP/s/W with total consumption of 38.8 kW. With this efficiency an exaFLOP/s performance will require about a giga-Watt of energy same as the peak output of an average nuclear power plant. It is quite evident that at current energy performance levels exascale computing will simply be not feasible. However, based upon the same Green 500 estimates, an interesting trend to note is that the accelerator-based clusters are 3.5 times more efficient than the traditional clusters. Taking heed of the looming energy impasse different vendors are proposing variety of trajectories for architecture roadmaps that have massively parallel compute engines on single chips. The multi-core trajectories tend to add more cores to the central process- ing units (CPUs). New Intel i7 processors have 4 proces- sor cores. Next generation are slated to have 6-8 cores. In contrast, many-core trajectory offers a very large num- ber of lightweight processing cores. Latest NVIDIA Fermi GPU has 512 cores. The GPUs are originally designed for image and video rendering primarily for the gaming indus-