GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator * Chunhua Deng Department of ECE Rutgers University - New Brunswick Piscataway, NJ, USA chunhua.deng@rutgers.edu Yang Sui Department of ECE Rutgers University - New Brunswick Piscataway, NJ, USA yang.sui@rutgers.edu Siyu Liao + Amazon Research Seattle, WA, USA liasiyu@amazon.com Xuehai Qian Department of Computer Science University of Southern California Los Angeles, CA, USA xuehai.qian@usc.edu Bo Yuan Department of ECE Rutgers University - New Brunswick Piscataway, NJ, USA bo.yuan@soe.rutgers.edu Abstract—The co-existence of activation sparsity and model sparsity in convolutional neural network (CNN) models makes sparsity-aware CNN hardware designs very attractive. The ex- isting sparse CNN accelerators utilize intersection operation to search and identify the key positions of the matched entries between two sparse vectors, and hence avoid unnecessary com- putations. However, these state-of-the-art designs still suffer from three major architecture-level drawbacks, including 1) hardware cost for the intersection operation is high; 2) frequent stalls of computation phase due to strong data dependency between intersection and computation phases; and 3) unnecessary data transfer incurred by the explicit intersection operation. By leveraging the knowledge of the complete sparse 2-D convolution, this paper proposes two key ideas that overcome all of the three drawbacks. First, an implicit on-the-fly intersection is proposed to realize the optimal solution for intersection between one static stream and one dynamic stream, which is the case for sparse neural network inference. Second, by leveraging the global computation structure of 2-D convolution, we propose a specialized computation reordering to ensure that the activation is only transferred if necessary and only once. Based on these two key ideas, we develop GoSPA, an energy- efficient high-performance Globally Optimized SParse CNN Accelerator. GoSPA is implemented with CMOS 28nm technol- ogy. Compared with the state-of-the-art sparse CNN architecture, GoSPA achieves average 1.38×, 1.28×, 1.23×, 1.17×, 1.21× and 1.28× speedup on AlexNet, VGG, GoogLeNet, MobileNet, ResNet and ResNeXt workloads, respectively. Also, GoSPA achieves 5.38×, 4.96×, 4.79×, 5.02×, 4.86× and 2.06× energy efficiency improvement on AlexNet, VGG, GoogLeNet, MobileNet, ResNet and ResNeXt, respectively. In more comprehensive comparison including DRAM access, GoSPA also shows significant perfor- mance improvement over the existing designs. Index Terms—CNN, Hardware Accelerator, ASIC, Sparse, Convolution * This work is supported by National Science Foundation (NSF) award CCF-1937403, CCF-1750656, CCF-1919289. + The work was done while the author was at Rutgers University. I. I NTRODUCTION Convolutional Neural Networks (CNNs) have achieved un- precedented success in many artificial intelligence (AI) tasks, such as image classification, object detection, video analysis etc. Due to the computation based on 2-D convolution over multiple large-size activation maps, CNNs are very compu- tation and storage intensive, suffering large amount of data movement. To accelerate the execution of CNNs, especially for the inference phase, designing domain-specific CNN hardware accelerators has become a promising solution because of the significant improvement on speed, throughput and energy efficiency thanks to customized design methodology. To date, many CNN accelerator designs have been proposed and re- ported in academic papers [2]–[6], [9]–[18], [20]–[24], [26]– [31], [33], [34], [37]–[44], [46]–[54], [57]–[60]. Meanwhile, CNN hardware accelerators, especially for inference-only, are also actively investigated by many startup companies because of the huge market of low-power embedded vision. Among several types of existing CNN accelerators, the sparsity-aware designs are particularly important and attrac- tive because of the much higher energy efficiency and pro- cessing throughput compared to the non-sparsity-aware ones. Motivated by these benefits, several sparse CNN hardware architectures [1], [3], [7], [18], [32], [36], [60] have been developed and proposed recently. Among them, SCNN [36] is the first novel dataflow that considers both activation and weight sparsity, thereby achieving high hardware performance. However, SCNN is not optimal since it incurs architecturally-wasted multiplications and unnecessary data transfer. To realize convolution between a sparse kernel and sparse activation map, SCNN first performs multiplications among all the non-zero weights and non-zero 1110 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) 978-1-6654-3333-4/21/$31.00 ©2021 IEEE DOI 10.1109/ISCA52012.2021.00090 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) | 978-1-6654-3333-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISCA52012.2021.00090