NSREC 2021 1 Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUs Fernando Fernandes dos Santos+, Sujit Malde*, Carlo Cazzaniga* Christopher Frost*, Luigi Carro+, and Paolo Rech+ +Institute of Informatics of Universidade Federal do Rio Grande do Sul (UFRGS), Brazil *Science and Technology Facility Council (STFC), UKRI Abstract—We investigate the sources of Detected Unrecover- able Errors (DUEs) in GPUs exposed to neutron beams. Illegal memory accesses and interface errors are among the more likely sources of DUEs. ECC increases the launch failure events. Our test procedure has shown that ECC can reduce the DUEs caused by Illegal Address access up to 92% for Kepler and 98% for Volta. I. I NTRODUCTION Graphics Processing Units (GPUs) have evolved from graphics rendering to general-purpose accelerators extensively employed in HPC and safety-critical applications such as au- tonomous vehicles for the automotive and aerospace markets. The highly parallel architecture of GPUs fits the computational characteristic of most HPC codes and of Convolutional Neural Networks (CNNs) used to detect objects. The most recent GPU architecture advances, such as tensor core and mixed-precision functional units, move toward improving the performances and software flexibility for HPC and deep learning applications. Today, the reliability of parallel processors is a significant concern for both safety-critical applications and HPC systems. Unexpected errors in parallel devices’ may lead to catastrophic accidents in self-driving vehicles and, in HPC systems, to lower scientific productivity, lower operational efficiency, and even significant monetary loss. Most recent studies target Silent Data Corruption (SDC) in their evaluation. SDCs, being undetectable, are in fact considered the main threat for modern computing devices reliability [1]. Detected Unrecoverable Errors (DUEs), such as device hangs, application crashes, or functional interruptions, are considered less harmful as, being detectable by definition, they could be easily handled using solutions such as check- points, and software/hardware watchdogs [2], [3]. Neverthe- less, the recovery from a DUE or the action taken to reach a fail-safe state require a significant amount of time, which risks reducing supercomputers productivity. A small cluster with 32K cores would take almost an hour to restart after a crash [2], without considering the overhead of performing checkpointing time. In safety-critical real-time systems, such as autonomous vehicles, the DUE risk is even higher, as it may compromise the system’s ability to complete the task This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 886202 and from The Coordenac ¸˜ ao de Aperfeic ¸oamento de Pessoal de N´ ıvel Superior - Brazil (CAPES) - Finance Code 001. before the deadline. For instance, a GPU for autonomous vehicles must process 40 frames-per-second. The recovery from a DUE must be sufficiently efficient not to miss any frame, which is highly challenging. In this scenario, tracing the software and hardware sources for DUEs and quickly identify the occurrence of a DUE are an essential tools to create more tolerant applications against crashes and hangs. In this paper, we investigate the sources of DUE in two NVIDIA architectures: Kepler and Volta. We provide a novel and detailed analysis of DUE sources on GPUs, based on neutron experimental data and system logs profile. We create a framework that allows the tracing of the GPU crashes and hangs observed during radiation experiments. We select a set of eight algorithms and compare their DUE and SDC rates, considering both the case of ECC disabled and enabled. Each code has peculiar characteristics regarding memory utiliza- tion, computing power, control-flow operation, highlighting specific architecture behaviors that could be generalized to similar algorithms. We report findings from recently completed (remotely controlled) neutron beam testing that represents a total of more than 2 million years of operation in a natural environment. Finally, we discuss how the use of system log tracing can make DUEs detection (and thus recovery) faster. II. RADIATION INDUCED SDCS AND DUES IN GPUS A transient fault leads to one of the following outcomes: (1) no effect on the program output (i.e., the fault is masked, or the corrupted data is not used), (2) a Silent Data Cor- ruption (SDC) (i.e., an incorrect program output), or (3) a Detected Unrecoverable Error (DUE) (i.e., a program crash or device reboot). Previous studies have stated that parallel architectures, particularly GPUs, have a high fault rate because of the high amount of available resources [4], [5]. Recent works have identified some peculiar reliability weaknesses of GPUs architecture, suspecting that the corruption of the GPU hardware scheduler or shared memories can severely impact the computation of several parallel threads [4], [6], [7]. As a result, multiple GPU output elements can potentially be corrupted, effectively undermining several applications’ reliability, including CNNs [8], [9]. Even if DUEs are detectable, they can lead to monetary loss or harmful events. For instance, a self-driving car that relies on a GPU to perform object detection, if rebooted, can delay a response to a critical situation, thus putting human lives in arXiv:2108.00554v1 [cs.DC] 1 Aug 2021