Microelectronics Reliability 115 (2020) 113969
Available online 28 October 2020
0026-2714/© 2020 Elsevier Ltd. All rights reserved.
Review paper
Soft errors in DNN accelerators: A comprehensive review
Younis Ibrahim
a
, Haibin Wang
a, f, *
, Junyang Liu
a
, Jinghe Wei
b
, Li Chen
c
, Paolo Rech
d
,
Khalid Adam
e
, Gang Guo
f
a
College of IoT Engineering, Hohai University, Changzhou, Jiangsu, China
b
No.58 Research Institute, China Electronics Technology Group Co., Wuxi, Jiangsu, China
c
University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, Canada
d
Institute of Informatics, University of UFRGS, Porto Alegre, Rio Grande do Sul, Brazil
e
Faculty of Electrical and Electronics Engineering, University Malaysia Pahang (UMP), Malaysia
f
China Institute of Atomic Energy, Beijing, China
A R T I C L E INFO
Keywords:
Deep learning
DNN accelerators
GPU
FPGA
ASIC
Soft errors
Reliability
ABSTRACT
Deep learning tasks cover a broad range of domains and an even more extensive range of applications, from
entertainment to extremely safety-critical felds. Thus, Deep Neural Network (DNN) algorithms are implemented
on different systems, from small embedded devices to data centers. DNN accelerators have proven to be a key to
effciency, as they are even more effcient than CPUs. Therefore, they have become the major executing hardware
for DNN algorithms. However, these accelerators are susceptible to several types of faults. Soft errors pose a
particular threat because the high-level parallelism in these accelerators can propagate a single failure to mul-
tiple errors in the next levels until the model predictions’ output is affected. This article presents a compre-
hensive review of the reliability of the DNN accelerators. The study begins by reviewing the widely assumed
claim that DNNs are inherently tolerant to faults. Then, the available DNN accelerators are systematically
classifed into several categories. Each is individually analyzed; and the commonly used accelerators are
compared in an attempt to answer the question, which accelerator is more reliable against transient faults? The
concluding part of this review highlights the gray areas of the DNNs and predicts future research directions that
will enhance its applicability. This study is expected to beneft researchers in the areas of deep learning, DNN
accelerators, and reliability of this effcient paradigm.
1. Introduction
Deep Neural Networks (DNNs) have been progressing at a phenom-
enal pace and gaining unprecedented popularity in recent years.
Following the latest report undertaken by International Data Corpora-
tion (IDC), spending on Artifcial Intelligence (AI) systems alone was
$24.0 billion in 2019 and expected to reach $97.9 billion in 2023, with
the most considerable portion spent on AI hardware [1]. The report
indicates the steady growth of this feld. One reason is that DNNs have
become the predominant approach for most of the previous machine
learning tasks [2]. DNNs have been adopted in a variety of applications
[3], such as computer vision [4], speech recognition [5], and natural
language processing [6]. This technology has far outstripped human
abilities in the area of vision [7–9]. DNN technology has attracted
intense research interest in almost every feld that involves advanced
data analytics and intelligent control. Consequently, DNNs are currently
widely used in complex mission-critical systems, such as autonomous
vehicles [10] and healthcare [11]. Besides, other felds are vigorously
seeking this effcient paradigm, such as NASA’s space applications
[12,13].
Artifcial Neural Networks (ANNs) are considered to be tolerant to
transient faults (i.e., soft errors). This is based on some observations
commonly claimed to be facts. First, as ANNs contain a higher number of
neurons (thousands to millions), more than required to accomplish a
specifc task [14], they can still perform their overall purpose even if
some of the neurons are not working. Second, since ANNs mimic the
human brain, which can tolerate some level of neuron faults or even use
noise as a source of computation [15], ANNs are considered to have this
intrinsic feature (fault tolerance) of biological systems [16]. Third,
maximizing performance is the main issue that dominates the research
work in the feld of AI. [17]. Fourth, the learning capability of ANNs
during and after the training process seems to indicate that ANNs have a
* Corresponding author.
E-mail address: wanghaibin@hhuc.edu.cn (H. Wang).
Contents lists available at ScienceDirect
Microelectronics Reliability
journal homepage: www.elsevier.com/locate/microrel
https://doi.org/10.1016/j.microrel.2020.113969
Received 11 November 2019; Received in revised form 22 September 2020; Accepted 12 October 2020