SkippyNN: An Embedded Stochastic-Computing Accelerator for
Convolutional Neural Networks
Reza Hojabr
1
, Kamyar Givaki
1
, SM. Reza Tayaranian
1
, Parsa Esfahanian
2
,
Ahmad Khonsari
12
, Dara Rahmati
2
, M. Hassan Najaf
3
1
School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran
2
School of Computer Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
3
School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA
{r.hojabr,givakik,m.taiaranian}@ut.ac.ir,{parsa.esfahanian,ak,dara.rahmati}@ipm.ir,najaf@louisiana.edu
ABSTRACT
Employing convolutional neural networks (CNNs) in embedded
devices seeks novel low-cost and energy efcient CNN accelerators.
Stochastic computing (SC) is a promising low-cost alternative to
conventional binary implementations of CNNs. Despite the low-
cost advantage, SC-based arithmetic units sufer from prohibitive
execution time due to processing long bit-streams. In particular,
multiplication as the main operation in convolution computation, is
an extremely time-consuming operation which hampers employing
SC methods in designing embedded CNNs.
In this work, we propose a novel architecture, called SkippyNN,
that reduces the computation time of SC-based multiplications in
the convolutional layers of CNNs. Each convolution in a CNN is
composed of numerous multiplications where each input value is
multiplied by a weight vector. Producing the result of the frst mul-
tiplication, the following multiplications can be performed by mul-
tiplying the input and the diferences of the successive weights.
Leveraging this property, we develop a diferential Multiply-and-
Accumulate unit, called DMAC, to reduce the time consumed by
convolutions in SkippyNN. We evaluate the efciency of SkippyNN
using four modern CNNs. On average, SkippyNN ofers 1.2x speedup
and 2.7x energy saving compared to the binary implementation of
CNN accelerators.
1 INTRODUCTION
A wide range of applications based on convolutional neural net-
works (CNNs) are emerging in various areas of computer vision. In
particular, employing CNNs in intelligent embedded devices inter-
acting with real-world environment has led to the advent of efcient
CNN accelerators. Limited computation resources and inadequate
power budget are two important challenges when applying neural
networks to embedded devices. Customized hardware implementa-
tions have gained a lot of attention in recent years to tackle these
challenges [5, 6, 25].
Recently, a handful of works have exploited stochastic computing
(SC) [2] in designing low-cost CNN accelerators [4, 4, 7, 11, 13, 14, 16,
19, 22, 23]. Compared to the conventional binary implementations,
SC-based implementations often ofer a lower power consumption, a
lower hardware area footprint, and a higher tolerance to soft errors
(i.e. bit fips) [3]. In SC, each number X (that is interpreted as the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
DAC ’19, June 2ś6, 2019, Las Vegas, NV, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6725-7/19/06. . . $15.00
https://doi.org/10.1145/3316781.3317911
probability P (x ) in range [0,1]), is represented by a bit-stream in
which the density of 1s denotes P (x ) [3]. For instance, a binary num-
ber X = 0.101
2
that is interpreted as P (x ) = 5/8, can be represented
by a bit-stream S = 11101001 where the number of 1s appeared in
the bit-stream and the length of the bit-stream are fve and eight,
respectively. Bit-stream-based representation makes SC numbers
more tolerable to the soft errors compared to conventional binary
radix representation. A single bit-fip in binary representation (e.g.,
a bit-fip in the most signifcant bit) may lead to a huge error while
in a SC bit-stream can cause only a small change in the value. Sim-
plicity of design is another important advantage. Most arithmetic
operations require extremely simple logic in SC. For instance, multi-
plication operation is performed using a single AND gate which has a
considerably lower hardware cost than the binary multiplier [2, 15].
Despite these benefts, SC-based operations encounter two impor-
tant problems: low accuracy and long computation time [2]. Prior
work showed that due to the approximate nature of neural networks,
CNN accelerators can be implemented by low-bitwidth binary arith-
metic units at no accuracy loss [6, 21, 26]. Our observations further
confrm that, similar to binary implementations, with long enough
bit-streams SC-based units do not impose a considerable degradation
on the neural network accuracy. Nevertheless, there is still a great
demand to decrease the computation time and to improve the energy
efciency of SC-based CNN accelerators.
In this work, we propose a novel SC-based architecture, SkippyNN,
which aims at reducing the computation time of stochastic multipli-
cations in the convolution kernel as these operations constitute a
substantial portion of computation load in modern CNNs. Each con-
volution is composed of numerous multiplications where an input x
i
is multiplied by the successive weights w
1
, ..., w
k
. Computation time
of SC-based multiplications is proportional to the bit-stream length
of the operands. Provided by maintaining the result of (x
i
× w
1
), to
calculate the term x
i
× w
2
, we can calculate x
i
×( w
2
- w
1
) and then
add the result to x
i
× w
1
which is already prepared. Employing this
arithmetic property results in a considerable reduction in the mul-
tiplication time as the length of w
2
- w
1
bit-stream is less than the
length of w
2
bit-stream in the developed architecture. We introduce
a diferential Multiply-and-Accumulate unit, called DMAC, to exploit
this property in the SkippyNN architecture. By sorting the weights
in a weight vector, SkippyNN minimizes the diferences between
successive weights and consequently, minimizes the computation
time of multiplications.
In convolutional layers, each flter consists of both positive and
negative weights. The conventional approach to handle signed op-
erations in the SC-based designs is by using the bipolar SC do-
main [2, 19]. The range of numbers is extended from [0, 1] in the
unipolar domain to [-1, 1] in the bipolar domain at the cost of dou-
bling the length of bit-streams and so doubling the processing time.