CRSPU: Exploit C ommonality of R egular S parsity to Support Various Convolutions on Systolic Arrays Jianchao Yang, Mei Wen  , Junzhong Shen, Yasong Cao, Minjin Tang, Renyu Yang, Xin Ju and Chunyuan Zhang College of Computer, National University of Defense Technology, Changsha, China {yangjianchao16, meiwen, shenjunzhong, caoyasong, tangminjin14, yangrenyu, jx, cyzhang}@nudt.edu.cn Abstract—Dilated convolution (DCONV) and transposed con- volution (TCONV) are involved in the training of GANs and CNNs and introduces numerous regular zero-spaces into the feature maps or kernels. Existing accelerators typically pre- reorganize the zero-spaces, and then perform sparse computation to accelerate them, resulting in huge hardware resource overhead and control complexity. While the systolic array has proven advantages when it comes to accelerating convolutions, counter- measures for deploying DCONV and TCONV on systolic arrays are rarely proposed. Therefore, we opt to improve the traditional im2col algorithm to make full use of the regular sparsity and avoid data reorganization, thereby facilitating the use of systolic arrays in this context. Public Dimension Compression and Simi- lar Sparsity Merging mechanisms are also designed to implement sparse computing, eliminating unnecessary computing caused by zero-spaces. We propose a systolic array-based processing unit, named CRSPU. Experiments show that CRSPU exhibits more competitive performance than the state-of-the-art baseline accelerator GANPU. Furthermore, CRSPU’s ability to avoid zero-space data reorganization represents a huge advantage for bandwidth-unfriendly accelerators. Index Terms—convolutions, im2col, systolic array I. I NTRODUCTION Convolutional neural networks (CNNs) and generative ad- versarial networks (GANs) have been widely deployed in the ﬁelds of image classiﬁcation, image super-resolution, and video prediction. The kernel operation during the procedure of inference and training of CNNs and GANs will unavoidably involve convolution (CONV), dilated convolution (DCONV) and transposed convolution (TCONV), as detailed in TA- BLE I. Different from the downsampling CONV, DCONV and TCONV insert large numbers of zeros into the feature map and the convolving kernel (see Fig. 1), thereby realizing upsampling and increasing the receptive ﬁeld size in cost- efﬁcient ways respectively. Notably, these three types of con- volution require complicated computation and large amounts of memory, resulting in signiﬁcant resource overhead and power consumption. However, the large numbers of inserted zeros involved in DCONV and TCONV further aggravate this problem. Previous accelerators have realized the acceleration of DCONV and TCONV by supporting sparse computation. The CNN accelerators SIGMA [1], along with the GAN ac- celerators FlexiGAN [2] and Kn2row [3], all of which support sparse computation, can realize zero-skipping computation by inserting zeros into the input feature map or kernel in advance. * Supported by National Nature Science Foundation of China under NCM No. 61802420 and 62002366.  Corresponding Author. However, hardware that supports zero-skipping generally re- quires indexes, while the complexity of the data preprocessing involved is difﬁcult to avoid introducing resource and power overheads [4]. In addition, some GAN accelerators [2], [3], [5], [6] do not fully exploit the prior knowledge of the regular sparsity(RS; sparsity introduced by regularly inserted zeros) of DCONV and TCONV. Moreover, the perceptual zero-skipping greatly increases the computational delay. DT-CNN [6] and GANPU [7] achieve imprecise computation by skipping the Multiply-Accumulate operations (MACs); as a result of this, the input or output feature map (IMP or OMP) is predicted to involve zeros, resulting in reduced inference accuracy. The cold buffer of GNA [8] is used to handle the overlap of partial sums without a zero-skipping mechanism, which not only leads to higher hardware overhead during data preprocessing, but also increases the complexity of control. TDC [9] attempts to convert TCONV into CONV, but the insertion of zeros in the weight blocks leads to unbalanced calculation load. F- DNA [10] requires more complex overall logic than TDC to eliminate the unbalanced calculation load. In addition, due to the complexity of zero-skipping logic, most GAN accelerators [3], [6], [11], [12] have no or only partial data multiplexing in their PE (processing element) arrays, with some even adopting broadcast mode [6], [11], [12], resulting in dramatically in- creased bandwidth requirements and low PE utilization. Most importantly, the zero pre-insertion computing mode requires users to be familiar with the underlying algorithm, which makes it difﬁcult to build a complete accelerator ecosystem. Basically, the GAN accelerators cannot make full use of data multiplexing and improve PE utilization, for the reason that they mostly adopt direct-convolution to accelerate DCONV and TCONV. CNN accelerators [1] have effectively proved that converting CONV into GEMM through im2col and mapping it on systolic arrays can reduce the bandwidth and memory resources re- quired. Moreover, systolic arrays increase the on-chip resident time of the data, thereby increasing the PE utilization. Since im2col+GEMM has a high degree of coupling to the data ﬂow, the zero-skipping computation is not suitable for direct map- ping on a systolic array; in addition, simply designing three sets of hardware to support CONV, DCONV and TCONV respectively results in seriously inefﬁcient resource utilization. Therefore, it is necessary and urgent to combine the RS characteristics of the three convolutions, the high data mul- tiplexing of the systolic array, and the optimization of implicit im2col for collaborative design. Our main contributions can 2023 Design, Automation & Test in Europe Conference (DATE 2023) 978-3-9819263-7-8/DATE23/© 2023 EDAA