1 TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference Mostafa Mahmoud 1 , Isak Edo 1 , Ali Hadi Zadeh 1 , Omar Mohamed Awad 1 , Gennady Pekhimenko 1,3 , Jorge Albericio 2 and Andreas Moshovos 1,3 1. University of Toronto, 2. Cerebras Systems, 3. Vector Institute {mostafa.mahmoud, isak.edo, a.hadizadeh, omar.awad}@mail.utoronto.ca, pekhimenko@cs.toronto.edu, jorge@cerebras.net, moshovos@ece.utoronto.ca Abstract Abstract: TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multiplier input, with an area efficient hardware scheduler. While the interconnect allows a very limited set of movements per operand, the scheduler can effectively extract sparsity when it is present in the activations, weights or gradients of neural networks. Over a wide set of models covering various applications, TensorDash accelerates the training process by 1.95× while being 1.89× more energy efficient, 1.6× more energy efficient when taking on-chip and off-chip memory accesses into account. While TensorDash works with any datatype, we demonstrate it with both single-precision floating-point units and bfloat16. 1 I NTRODUCTION Neural networks are being used in an ever increasing number of application domains delivering state-of-the-art results. Given their high computation and memory demands and their increasing importance, considerable attention has also been given into techniques for optimizing implementations at all system levels all the way down to specialized hardware. Whereas a decade ago the then state-of-the-art neural networks could be trained on a commodity server within a few hours, today training the best neural network models has become an exascale class problem [1]. State-of-the-art neural networks now require many graphics processors or specialized accelerators such as the TPU [2] so that they can be trained within practical time limits. Tuning neural networks for best performance during inference further exacerbates the cost of training. Beyond the cost of acquiring or getting access to such expensive computing resources, worse are the operating costs and the environmental impact of training. Strubell et al., report that the CO 2 emissions of training even a mid-class neural network stand at about 36 metric tons which is more than double the estimated 16.5 metric tons needed on average per person and per year in the US [3]. Training neural networks at the “edge” is needed in certain applications such as for example to refine an existing model with user-specific information and input. While the trade offs for edge devices are different than those in data centers or desktop applications, the need remains the same: reduce execution time and improve energy efficiency albeit under different constraints. It comes then as no surprise that efforts for reducing the execution time and the energy cost of training have been considerable. First and foremost, by exploiting model, data, and pipeline parallelism distributed training partitions the training workload across several computing nodes to reduce overall latency [4], [5], [6]. Intra- and inter-node data blocking, reuse, and communication and computation overlapping orchestrate the use of the computing, memory hierarchy, and communication resources to improve performance and energy efficiency [7], [8], [9]. Lossless and lossy compression reduces the footprint of the vast amounts of data processed during training [10]. While originally training used double precision floating-point data and arithmetic, more compact datatypes reduce overall data volumes and computation costs. These include: single precision floating-point, bfloat16 [11], [12], [13], dynamic floating-point [14], and flexpoint [15]. Mixed-datatype methods further reduce costs by performing many computations using fixed-point and few using some form of floating-point [14], [16], [17], [18]. Other methods use low precision arithmetic [19]. Even with these techniques training remains an exascale class problem and further improvements are needed. Accordingly, in this work we are proposing a technique for further improving execution time and energy efficiency for training. Specifically, we propose TensorDash exploits ineffectual operations that occur naturally for many models during training. The bulk of the energy during training is due to the transfers and computations needed to perform multiply-accumulate operations (MACs). We find that often one of the operands in these MACs is zero. These operations can be safely eliminated as they do not affect the values produced during training and thus convergence and final accuracy. We find that for many networks a large number of zeros naturally occur in the activation values during the forward and backward passes, and in the gradients during the backward pass (see Section 2.1 for a primer on training). When sparsity exists it represents an opportunity for improving performance and energy efficiency. Accordingly, we seek to develop a method that will do so when sparsity exists and that will not hurt performance and energy efficiency otherwise. arXiv:2009.00748v1 [cs.AR] 1 Sep 2020