Microelectronics Reliability 115 (2020) 113969 Available online 28 October 2020 0026-2714/© 2020 Elsevier Ltd. All rights reserved. Review paper Soft errors in DNN accelerators: A comprehensive review Younis Ibrahim a , Haibin Wang a, f, * , Junyang Liu a , Jinghe Wei b , Li Chen c , Paolo Rech d , Khalid Adam e , Gang Guo f a College of IoT Engineering, Hohai University, Changzhou, Jiangsu, China b No.58 Research Institute, China Electronics Technology Group Co., Wuxi, Jiangsu, China c University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, Canada d Institute of Informatics, University of UFRGS, Porto Alegre, Rio Grande do Sul, Brazil e Faculty of Electrical and Electronics Engineering, University Malaysia Pahang (UMP), Malaysia f China Institute of Atomic Energy, Beijing, China A R T I C L E INFO Keywords: Deep learning DNN accelerators GPU FPGA ASIC Soft errors Reliability ABSTRACT Deep learning tasks cover a broad range of domains and an even more extensive range of applications, from entertainment to extremely safety-critical felds. Thus, Deep Neural Network (DNN) algorithms are implemented on different systems, from small embedded devices to data centers. DNN accelerators have proven to be a key to effciency, as they are even more effcient than CPUs. Therefore, they have become the major executing hardware for DNN algorithms. However, these accelerators are susceptible to several types of faults. Soft errors pose a particular threat because the high-level parallelism in these accelerators can propagate a single failure to mul- tiple errors in the next levels until the model predictions’ output is affected. This article presents a compre- hensive review of the reliability of the DNN accelerators. The study begins by reviewing the widely assumed claim that DNNs are inherently tolerant to faults. Then, the available DNN accelerators are systematically classifed into several categories. Each is individually analyzed; and the commonly used accelerators are compared in an attempt to answer the question, which accelerator is more reliable against transient faults? The concluding part of this review highlights the gray areas of the DNNs and predicts future research directions that will enhance its applicability. This study is expected to beneft researchers in the areas of deep learning, DNN accelerators, and reliability of this effcient paradigm. 1. Introduction Deep Neural Networks (DNNs) have been progressing at a phenom- enal pace and gaining unprecedented popularity in recent years. Following the latest report undertaken by International Data Corpora- tion (IDC), spending on Artifcial Intelligence (AI) systems alone was $24.0 billion in 2019 and expected to reach $97.9 billion in 2023, with the most considerable portion spent on AI hardware [1]. The report indicates the steady growth of this feld. One reason is that DNNs have become the predominant approach for most of the previous machine learning tasks [2]. DNNs have been adopted in a variety of applications [3], such as computer vision [4], speech recognition [5], and natural language processing [6]. This technology has far outstripped human abilities in the area of vision [7–9]. DNN technology has attracted intense research interest in almost every feld that involves advanced data analytics and intelligent control. Consequently, DNNs are currently widely used in complex mission-critical systems, such as autonomous vehicles [10] and healthcare [11]. Besides, other felds are vigorously seeking this effcient paradigm, such as NASA’s space applications [12,13]. Artifcial Neural Networks (ANNs) are considered to be tolerant to transient faults (i.e., soft errors). This is based on some observations commonly claimed to be facts. First, as ANNs contain a higher number of neurons (thousands to millions), more than required to accomplish a specifc task [14], they can still perform their overall purpose even if some of the neurons are not working. Second, since ANNs mimic the human brain, which can tolerate some level of neuron faults or even use noise as a source of computation [15], ANNs are considered to have this intrinsic feature (fault tolerance) of biological systems [16]. Third, maximizing performance is the main issue that dominates the research work in the feld of AI. [17]. Fourth, the learning capability of ANNs during and after the training process seems to indicate that ANNs have a * Corresponding author. E-mail address: wanghaibin@hhuc.edu.cn (H. Wang). Contents lists available at ScienceDirect Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel https://doi.org/10.1016/j.microrel.2020.113969 Received 11 November 2019; Received in revised form 22 September 2020; Accepted 12 October 2020