Leading Computational Methods on Scalar and Vector HEC Platforms Leonid Oliker, Jonathan Carter, Michael Wehner, Andrew Canning CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Stephane Ethier Princeton Plasma Physics Laboratory, Princeton University, Princeton, NJ 08453 Art Mirin, Govindasamy Bala Lawrence Livermore National Laboratory, Livermore, CA 94551 David Parks NEC Solutions America, Advanced Technical Computing Center, The Woodlands, TX 77381 Patrick Worley Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831 Shigemune Kitawaki, Yoshinori Tsuda Earth Simulator Center, Japan Agency for Marine-Earth Science and Technology, Yokohama Japan ABSTRACT The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end computing (HEC) platforms, primarily because of their generality, scalability, and cost effectiveness. However, the growing gap between sustained and peak performance for full-scale scientific applications on con- ventional supercomputers has become a major concern in high per- formance computing, requiring significantly larger systems and ap- plication scalability than implied by peak performance in order to achieve desired performance. The latest generation of custom-built parallel vector systems have the potential to address this issue for numerical algorithms with sufficient regularity in their computa- tional structure. In this work we explore applications drawn from four areas: atmospheric modeling (CAM), magnetic fusion (GTC), plasma physics (LBMHD3D), and material science (PARATEC). We compare performance of the vector-based Cray X1, Earth Sim- ulator, and newly-released NEC SX-8 and Cray X1E, with perfor- mance of three leading commodity-based superscalar platforms uti- lizing the IBM Power3, Intel Itanium2, and AMD Opteron proces- sors. Our work makes several significant contributions: the first reported vector performance results for CAM simulations utilizing a finite-volume dynamical core on a high-resolution atmospheric grid; a new data-decomposition scheme for GTC that (for the first time) enables a breakthrough of the Teraflop barrier; the introduc- tion of a new three-dimensional Lattice Boltzmann magneto-hy- drodynamic implementation used to study the onset evolution of plasma turbulence that achieves over 26Tflop/s on 4800 ES pro- (c) 2005 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC|05 November 12-18, 2005, Seattle, Washington, USA (c) 2005 ACM 1-59593-061-2/05/0011...$5.00 cessors; and the largest PARATEC cell size atomistic simulation to date. Overall, results show that the vector architectures attain unprecedented aggregate performance across our application suite, demonstrating the tremendous potential of modern parallel vector systems. 1. INTRODUCTION Due to their cost effectiveness, an ever-growing fraction of to- day’s supercomputers employ commodity superscalar processors, arranged as systems of interconnected SMP nodes. However, the constant degradation of superscalar sustained performance has be- come a well-known problem in the scientific computing commu- nity [1]. This trend has been widely attributed to the use of super- scalar-based commodity components whose architectural designs offer a balance between memory performance, network capabil- ity, and execution rate that is poorly matched to the requirements of large-scale numerical computations. The latest generation of custom-built parallel vector systems may address these challenges for numerical algorithms amenable to vectorization. Vector architectures exploit regularities in computational struc- tures, issuing uniform operations on independent data elements, thus allowing memory latencies to be masked by overlapping pipe- lined vector operations with memory fetches. Vector instructions specify a large number of identical operations that may execute in parallel, thereby reducing control complexity and efficiently con- trolling a large amount of computational resources. However, as described by Amdahl’s Law, the time taken by the portions of the code that are non-vectorizable can dominate the execution time, significantly reducing the achieved computational rate. In order to quantify what modern vector capabilities imply for the scientific communities that rely on modeling and simulation, it is critical to evaluate vector systems in the context of demand- ing computational algorithms. This study examines the behavior of four diverse scientific applications with the potential to run at ultra-scale, in the areas of atmospheric modeling (CAM), plasma physics (GTC), magnetic fusion (LBMHD3D), and material sci- ence (PARATEC). We compare the performance of leading com-