1 TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference Mostafa Mahmoud 1 , Isak Edo 1 , Ali Hadi Zadeh 1 , Omar Mohamed Awad 1 , Gennady Pekhimenko 1,3 , Jorge Albericio 2 and Andreas Moshovos 1,3 1. University of Toronto, 2. Cerebras Systems, 3. Vector Institute {mostafa.mahmoud, isak.edo, a.hadizadeh, omar.awad}@mail.utoronto.ca, pekhimenko@cs.toronto.edu, jorge@cerebras.net, moshovos@ece.utoronto.ca ✦ Abstract Abstract: TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efﬁciency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multiplier input, with an area efﬁcient hardware scheduler. While the interconnect allows a very limited set of movements per operand, the scheduler can effectively extract sparsity when it is present in the activations, weights or gradients of neural networks. Over a wide set of models covering various applications, TensorDash accelerates the training process by 1.95× while being 1.89× more energy efﬁcient, 1.6× more energy efﬁcient when taking on-chip and off-chip memory accesses into account. While TensorDash works with any datatype, we demonstrate it with both single-precision ﬂoating-point units and bﬂoat16. 1 I NTRODUCTION Neural networks are being used in an ever increasing number of application domains delivering state-of-the-art results. Given their high computation and memory demands and their increasing importance, considerable attention has also been given into techniques for optimizing implementations at all system levels all the way down to specialized hardware. Whereas a decade ago the then state-of-the-art neural networks could be trained on a commodity server within a few hours, today training the best neural network models has become an exascale class problem [1]. State-of-the-art neural networks now require many graphics processors or specialized accelerators such as the TPU [2] so that they can be trained within practical time limits. Tuning neural networks for best performance during inference further exacerbates the cost of training. Beyond the cost of acquiring or getting access to such expensive computing resources, worse are the operating costs and the environmental impact of training. Strubell et al., report that the CO 2 emissions of training even a mid-class neural network stand at about 36 metric tons which is more than double the estimated 16.5 metric tons needed on average per person and per year in the US [3]. Training neural networks at the “edge” is needed in certain applications such as for example to reﬁne an existing model with user-speciﬁc information and input. While the trade offs for edge devices are different than those in data centers or desktop applications, the need remains the same: reduce execution time and improve energy efﬁciency albeit under different constraints. It comes then as no surprise that efforts for reducing the execution time and the energy cost of training have been considerable. First and foremost, by exploiting model, data, and pipeline parallelism distributed training partitions the training workload across several computing nodes to reduce overall latency [4], [5], [6]. Intra- and inter-node data blocking, reuse, and communication and computation overlapping orchestrate the use of the computing, memory hierarchy, and communication resources to improve performance and energy efﬁciency [7], [8], [9]. Lossless and lossy compression reduces the footprint of the vast amounts of data processed during training [10]. While originally training used double precision ﬂoating-point data and arithmetic, more compact datatypes reduce overall data volumes and computation costs. These include: single precision ﬂoating-point, bﬂoat16 [11], [12], [13], dynamic ﬂoating-point [14], and ﬂexpoint [15]. Mixed-datatype methods further reduce costs by performing many computations using ﬁxed-point and few using some form of ﬂoating-point [14], [16], [17], [18]. Other methods use low precision arithmetic [19]. Even with these techniques training remains an exascale class problem and further improvements are needed. Accordingly, in this work we are proposing a technique for further improving execution time and energy efﬁciency for training. Speciﬁcally, we propose TensorDash exploits ineffectual operations that occur naturally for many models during training. The bulk of the energy during training is due to the transfers and computations needed to perform multiply-accumulate operations (MACs). We ﬁnd that often one of the operands in these MACs is zero. These operations can be safely eliminated as they do not affect the values produced during training and thus convergence and ﬁnal accuracy. We ﬁnd that for many networks a large number of zeros naturally occur in the activation values during the forward and backward passes, and in the gradients during the backward pass (see Section 2.1 for a primer on training). When sparsity exists it represents an opportunity for improving performance and energy efﬁciency. Accordingly, we seek to develop a method that will do so when sparsity exists and that will not hurt performance and energy efﬁciency otherwise. arXiv:2009.00748v1 [cs.AR] 1 Sep 2020