CRSPU: Exploit C ommonality of R egular S parsity
to Support Various Convolutions on Systolic Arrays
Jianchao Yang, Mei Wen
, Junzhong Shen, Yasong Cao, Minjin Tang, Renyu Yang, Xin Ju and Chunyuan Zhang
College of Computer, National University of Defense Technology, Changsha, China
{yangjianchao16, meiwen, shenjunzhong, caoyasong, tangminjin14, yangrenyu, jx, cyzhang}@nudt.edu.cn
Abstract—Dilated convolution (DCONV) and transposed con-
volution (TCONV) are involved in the training of GANs and
CNNs and introduces numerous regular zero-spaces into the
feature maps or kernels. Existing accelerators typically pre-
reorganize the zero-spaces, and then perform sparse computation
to accelerate them, resulting in huge hardware resource overhead
and control complexity. While the systolic array has proven
advantages when it comes to accelerating convolutions, counter-
measures for deploying DCONV and TCONV on systolic arrays
are rarely proposed. Therefore, we opt to improve the traditional
im2col algorithm to make full use of the regular sparsity and
avoid data reorganization, thereby facilitating the use of systolic
arrays in this context. Public Dimension Compression and Simi-
lar Sparsity Merging mechanisms are also designed to implement
sparse computing, eliminating unnecessary computing caused
by zero-spaces. We propose a systolic array-based processing
unit, named CRSPU. Experiments show that CRSPU exhibits
more competitive performance than the state-of-the-art baseline
accelerator GANPU. Furthermore, CRSPU’s ability to avoid
zero-space data reorganization represents a huge advantage for
bandwidth-unfriendly accelerators.
Index Terms—convolutions, im2col, systolic array
I. I NTRODUCTION
Convolutional neural networks (CNNs) and generative ad-
versarial networks (GANs) have been widely deployed in
the fields of image classification, image super-resolution, and
video prediction. The kernel operation during the procedure of
inference and training of CNNs and GANs will unavoidably
involve convolution (CONV), dilated convolution (DCONV)
and transposed convolution (TCONV), as detailed in TA-
BLE I. Different from the downsampling CONV, DCONV
and TCONV insert large numbers of zeros into the feature
map and the convolving kernel (see Fig. 1), thereby realizing
upsampling and increasing the receptive field size in cost-
efficient ways respectively. Notably, these three types of con-
volution require complicated computation and large amounts
of memory, resulting in significant resource overhead and
power consumption. However, the large numbers of inserted
zeros involved in DCONV and TCONV further aggravate this
problem. Previous accelerators have realized the acceleration
of DCONV and TCONV by supporting sparse computation.
The CNN accelerators SIGMA [1], along with the GAN ac-
celerators FlexiGAN [2] and Kn2row [3], all of which support
sparse computation, can realize zero-skipping computation by
inserting zeros into the input feature map or kernel in advance.
* Supported by National Nature Science Foundation of China under NCM
No. 61802420 and 62002366.
Corresponding Author.
However, hardware that supports zero-skipping generally re-
quires indexes, while the complexity of the data preprocessing
involved is difficult to avoid introducing resource and power
overheads [4]. In addition, some GAN accelerators [2], [3],
[5], [6] do not fully exploit the prior knowledge of the regular
sparsity(RS; sparsity introduced by regularly inserted zeros) of
DCONV and TCONV. Moreover, the perceptual zero-skipping
greatly increases the computational delay. DT-CNN [6] and
GANPU [7] achieve imprecise computation by skipping the
Multiply-Accumulate operations (MACs); as a result of this,
the input or output feature map (IMP or OMP) is predicted
to involve zeros, resulting in reduced inference accuracy. The
cold buffer of GNA [8] is used to handle the overlap of partial
sums without a zero-skipping mechanism, which not only
leads to higher hardware overhead during data preprocessing,
but also increases the complexity of control. TDC [9] attempts
to convert TCONV into CONV, but the insertion of zeros in
the weight blocks leads to unbalanced calculation load. F-
DNA [10] requires more complex overall logic than TDC to
eliminate the unbalanced calculation load. In addition, due to
the complexity of zero-skipping logic, most GAN accelerators
[3], [6], [11], [12] have no or only partial data multiplexing in
their PE (processing element) arrays, with some even adopting
broadcast mode [6], [11], [12], resulting in dramatically in-
creased bandwidth requirements and low PE utilization. Most
importantly, the zero pre-insertion computing mode requires
users to be familiar with the underlying algorithm, which
makes it difficult to build a complete accelerator ecosystem.
Basically, the GAN accelerators cannot make full use of data
multiplexing and improve PE utilization, for the reason that
they mostly adopt direct-convolution to accelerate DCONV
and TCONV.
CNN accelerators [1] have effectively proved that converting
CONV into GEMM through im2col and mapping it on systolic
arrays can reduce the bandwidth and memory resources re-
quired. Moreover, systolic arrays increase the on-chip resident
time of the data, thereby increasing the PE utilization. Since
im2col+GEMM has a high degree of coupling to the data flow,
the zero-skipping computation is not suitable for direct map-
ping on a systolic array; in addition, simply designing three
sets of hardware to support CONV, DCONV and TCONV
respectively results in seriously inefficient resource utilization.
Therefore, it is necessary and urgent to combine the RS
characteristics of the three convolutions, the high data mul-
tiplexing of the systolic array, and the optimization of implicit
im2col for collaborative design. Our main contributions can
2023 Design, Automation & Test in Europe Conference (DATE 2023)
978-3-9819263-7-8/DATE23/© 2023 EDAA