Feng Shui of Supercomputer Memory Positional Effects in DRAM and SRAM Faults Vilas Sridharan RAS Architecture Advanced Micro Devices, Inc. Boxborough, MA vilas.sridharan@amd.com Jon Stearley Scalable Architectures Sandia National Laboratories 1 Albuquerque, New Mexico jrstear@sandia.gov Nathan DeBardeleben Ultrascale Systems Research Center Los Alamos National Laboratory 2 Los Alamos, New Mexico ndebard@lanl.gov Sean Blanchard Ultrascale Systems Research Center Los Alamos National Laboratory 2 Los Alamos, New Mexico seanb@lanl.gov Sudhanva Gurumurthi AMD Research Advanced Micro Devices, Inc. Boxborough, MA sudhanva.gurumurthi@amd.com ABSTRACT Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing sys- tems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance com- puting systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate. 1 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04- 94AL85000. This document’s Sandia identifier is 2013- 3402C. 2 A portion of this work was performed at the Ultrascale Systems Research Center (USRC) at Los Alamos National Laboratory, supported by the U.S. Department of Energy contract DE-FC02-06ER25750. The publication has been assigned the LANL identifier LA-UR-13-22888. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC13 November 17-21, 2013, Denver, CO, USA Copyright 2013 ACM 978-1-4503-2378-9/13/11 ...$15.00. 1. INTRODUCTION Recent studies have confirmed that faults are common in memory systems of high-performance computing systems [23]. Moreover, the U.S. Department of Energy (DOE) currently predicts an exascale supercomputer in the early 2020s to have between 32 and 100 petabytes of main memory, a 100x to 350x increase compared to 2012 levels [6]. Similar in- creases are likely in the amount of cache memory (SRAM) in an exascale system. These systems will require compa- rable increases in the reliability of both SRAM and DRAM memories to maintain or improve system reliability relative to current systems. Therefore, further attention to the faults experienced by memory sub-systems is warranted. A proper understanding of hardware faults allows hardware and sys- tem architects to provision appropriate reliability mecha- nisms, and can affect operational procedures such as DIMM replacement policies. In this paper we present a study of DRAM and SRAM faults on two large high-performance computer systems. Our primary data set comes from Cielo, an 8,500-node supercom- puter located at Los Alamos National Laboratory (LANL). A secondary data set comes from Jaguar, an 18,688-node su- percomputer that was located at Oak Ridge National Labo- ratory. In Cielo, our measurement interval is a 15-month pe- riod from mid-2011 through early 2013, comprising 23 billion DRAM device-hours of data. In Jaguar, our measurement interval is an 11-month period from late 2009 through late 2010, comprising 17.1 billion DRAM device-hours of data. Both systems were in production and heavily utilized during their respective measurement intervals. There are several contributions of this research: We study the impact of aging on the DRAM fault rate. In contrast to previous studies [21], we find that the composition of DRAM faults changes substantially during the first two years of DRAM lifetime, shifting from primarily permanent faults to primarily transient faults. We examine the impact of DRAM vendor and device