SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks Reza Hojabr 1 , Kamyar Givaki 1 , SM. Reza Tayaranian 1 , Parsa Esfahanian 2 , Ahmad Khonsari 12 , Dara Rahmati 2 , M. Hassan Najaf 3 1 School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran 2 School of Computer Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran 3 School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA {r.hojabr,givakik,m.taiaranian}@ut.ac.ir,{parsa.esfahanian,ak,dara.rahmati}@ipm.ir,najaf@louisiana.edu ABSTRACT Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energy efcient CNN accelerators. Stochastic computing (SC) is a promising low-cost alternative to conventional binary implementations of CNNs. Despite the low- cost advantage, SC-based arithmetic units sufer from prohibitive execution time due to processing long bit-streams. In particular, multiplication as the main operation in convolution computation, is an extremely time-consuming operation which hampers employing SC methods in designing embedded CNNs. In this work, we propose a novel architecture, called SkippyNN, that reduces the computation time of SC-based multiplications in the convolutional layers of CNNs. Each convolution in a CNN is composed of numerous multiplications where each input value is multiplied by a weight vector. Producing the result of the frst mul- tiplication, the following multiplications can be performed by mul- tiplying the input and the diferences of the successive weights. Leveraging this property, we develop a diferential Multiply-and- Accumulate unit, called DMAC, to reduce the time consumed by convolutions in SkippyNN. We evaluate the efciency of SkippyNN using four modern CNNs. On average, SkippyNN ofers 1.2x speedup and 2.7x energy saving compared to the binary implementation of CNN accelerators. 1 INTRODUCTION A wide range of applications based on convolutional neural net- works (CNNs) are emerging in various areas of computer vision. In particular, employing CNNs in intelligent embedded devices inter- acting with real-world environment has led to the advent of efcient CNN accelerators. Limited computation resources and inadequate power budget are two important challenges when applying neural networks to embedded devices. Customized hardware implementa- tions have gained a lot of attention in recent years to tackle these challenges [5, 6, 25]. Recently, a handful of works have exploited stochastic computing (SC) [2] in designing low-cost CNN accelerators [4, 4, 7, 11, 13, 14, 16, 19, 22, 23]. Compared to the conventional binary implementations, SC-based implementations often ofer a lower power consumption, a lower hardware area footprint, and a higher tolerance to soft errors (i.e. bit fips) [3]. In SC, each number X (that is interpreted as the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. DAC ’19, June 2ś6, 2019, Las Vegas, NV, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6725-7/19/06. . . $15.00 https://doi.org/10.1145/3316781.3317911 probability P (x ) in range [0,1]), is represented by a bit-stream in which the density of 1s denotes P (x ) [3]. For instance, a binary num- ber X = 0.101 2 that is interpreted as P (x ) = 5/8, can be represented by a bit-stream S = 11101001 where the number of 1s appeared in the bit-stream and the length of the bit-stream are fve and eight, respectively. Bit-stream-based representation makes SC numbers more tolerable to the soft errors compared to conventional binary radix representation. A single bit-fip in binary representation (e.g., a bit-fip in the most signifcant bit) may lead to a huge error while in a SC bit-stream can cause only a small change in the value. Sim- plicity of design is another important advantage. Most arithmetic operations require extremely simple logic in SC. For instance, multi- plication operation is performed using a single AND gate which has a considerably lower hardware cost than the binary multiplier [2, 15]. Despite these benefts, SC-based operations encounter two impor- tant problems: low accuracy and long computation time [2]. Prior work showed that due to the approximate nature of neural networks, CNN accelerators can be implemented by low-bitwidth binary arith- metic units at no accuracy loss [6, 21, 26]. Our observations further confrm that, similar to binary implementations, with long enough bit-streams SC-based units do not impose a considerable degradation on the neural network accuracy. Nevertheless, there is still a great demand to decrease the computation time and to improve the energy efciency of SC-based CNN accelerators. In this work, we propose a novel SC-based architecture, SkippyNN, which aims at reducing the computation time of stochastic multipli- cations in the convolution kernel as these operations constitute a substantial portion of computation load in modern CNNs. Each con- volution is composed of numerous multiplications where an input x i is multiplied by the successive weights w 1 , ..., w k . Computation time of SC-based multiplications is proportional to the bit-stream length of the operands. Provided by maintaining the result of (x i × w 1 ), to calculate the term x i × w 2 , we can calculate x i ×( w 2 - w 1 ) and then add the result to x i × w 1 which is already prepared. Employing this arithmetic property results in a considerable reduction in the mul- tiplication time as the length of w 2 - w 1 bit-stream is less than the length of w 2 bit-stream in the developed architecture. We introduce a diferential Multiply-and-Accumulate unit, called DMAC, to exploit this property in the SkippyNN architecture. By sorting the weights in a weight vector, SkippyNN minimizes the diferences between successive weights and consequently, minimizes the computation time of multiplications. In convolutional layers, each flter consists of both positive and negative weights. The conventional approach to handle signed op- erations in the SC-based designs is by using the bipolar SC do- main [2, 19]. The range of numbers is extended from [0, 1] in the unipolar domain to [-1, 1] in the bipolar domain at the cost of dou- bling the length of bit-streams and so doubling the processing time.