GoSPA: An Energy-efficient High-performance
Globally Optimized SParse Convolutional Neural
Network Accelerator
*
Chunhua Deng
Department of ECE
Rutgers University - New Brunswick
Piscataway, NJ, USA
chunhua.deng@rutgers.edu
Yang Sui
Department of ECE
Rutgers University - New Brunswick
Piscataway, NJ, USA
yang.sui@rutgers.edu
Siyu Liao
+
Amazon Research
Seattle, WA, USA
liasiyu@amazon.com
Xuehai Qian
Department of Computer Science
University of Southern California
Los Angeles, CA, USA
xuehai.qian@usc.edu
Bo Yuan
Department of ECE
Rutgers University - New Brunswick
Piscataway, NJ, USA
bo.yuan@soe.rutgers.edu
Abstract—The co-existence of activation sparsity and model
sparsity in convolutional neural network (CNN) models makes
sparsity-aware CNN hardware designs very attractive. The ex-
isting sparse CNN accelerators utilize intersection operation to
search and identify the key positions of the matched entries
between two sparse vectors, and hence avoid unnecessary com-
putations. However, these state-of-the-art designs still suffer from
three major architecture-level drawbacks, including 1) hardware
cost for the intersection operation is high; 2) frequent stalls
of computation phase due to strong data dependency between
intersection and computation phases; and 3) unnecessary data
transfer incurred by the explicit intersection operation.
By leveraging the knowledge of the complete sparse 2-D
convolution, this paper proposes two key ideas that overcome all
of the three drawbacks. First, an implicit on-the-fly intersection is
proposed to realize the optimal solution for intersection between
one static stream and one dynamic stream, which is the case
for sparse neural network inference. Second, by leveraging the
global computation structure of 2-D convolution, we propose a
specialized computation reordering to ensure that the activation
is only transferred if necessary and only once.
Based on these two key ideas, we develop GoSPA, an energy-
efficient high-performance Globally Optimized SParse CNN
Accelerator. GoSPA is implemented with CMOS 28nm technol-
ogy. Compared with the state-of-the-art sparse CNN architecture,
GoSPA achieves average 1.38×, 1.28×, 1.23×, 1.17×, 1.21× and
1.28× speedup on AlexNet, VGG, GoogLeNet, MobileNet, ResNet
and ResNeXt workloads, respectively. Also, GoSPA achieves
5.38×, 4.96×, 4.79×, 5.02×, 4.86× and 2.06× energy efficiency
improvement on AlexNet, VGG, GoogLeNet, MobileNet, ResNet
and ResNeXt, respectively. In more comprehensive comparison
including DRAM access, GoSPA also shows significant perfor-
mance improvement over the existing designs.
Index Terms—CNN, Hardware Accelerator, ASIC, Sparse,
Convolution
*
This work is supported by National Science Foundation (NSF) award
CCF-1937403, CCF-1750656, CCF-1919289.
+
The work was done while the author was at Rutgers University.
I. I NTRODUCTION
Convolutional Neural Networks (CNNs) have achieved un-
precedented success in many artificial intelligence (AI) tasks,
such as image classification, object detection, video analysis
etc. Due to the computation based on 2-D convolution over
multiple large-size activation maps, CNNs are very compu-
tation and storage intensive, suffering large amount of data
movement. To accelerate the execution of CNNs, especially for
the inference phase, designing domain-specific CNN hardware
accelerators has become a promising solution because of
the significant improvement on speed, throughput and energy
efficiency thanks to customized design methodology. To date,
many CNN accelerator designs have been proposed and re-
ported in academic papers [2]–[6], [9]–[18], [20]–[24], [26]–
[31], [33], [34], [37]–[44], [46]–[54], [57]–[60]. Meanwhile,
CNN hardware accelerators, especially for inference-only, are
also actively investigated by many startup companies because
of the huge market of low-power embedded vision.
Among several types of existing CNN accelerators, the
sparsity-aware designs are particularly important and attrac-
tive because of the much higher energy efficiency and pro-
cessing throughput compared to the non-sparsity-aware ones.
Motivated by these benefits, several sparse CNN hardware
architectures [1], [3], [7], [18], [32], [36], [60] have been
developed and proposed recently. Among them, SCNN [36]
is the first novel dataflow that considers both activation and
weight sparsity, thereby achieving high hardware performance.
However, SCNN is not optimal since it incurs
architecturally-wasted multiplications and unnecessary
data transfer. To realize convolution between a sparse
kernel and sparse activation map, SCNN first performs
multiplications among all the non-zero weights and non-zero
1110
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
978-1-6654-3333-4/21/$31.00 ©2021 IEEE
DOI 10.1109/ISCA52012.2021.00090
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) | 978-1-6654-3333-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISCA52012.2021.00090