Towards Accurate, Energy-Efﬁcient, & Low-Latency Spiking LSTMs Gourav Datta 1 , Haoqin Deng 2 * , Robert Aviles 1 , Peter A. Beerel 1 1 University of Southern California, USA 2 University of Washington, Seattle, USA Abstract Spiking Neural Networks (SNNs) have emerged as an attrac- tive spatio-temporal computing paradigm for complex vision tasks. However, most existing works yield models that re- quire many time steps and do not leverage the inherent tempo- ral dynamics of spiking neural networks, even for sequential tasks. Motivated by this observation, we propose an optimized spiking long short-term memory networks (LSTM) training framework that involves a novel ANN-to-SNN conversion framework, followed by SNN training. In particular, we pro- pose novel activation functions in the source LSTM architec- ture and judiciously select a subset of them for conversion to integrate-and-ﬁre (IF) activations with optimal bias shifts. Additionally, we derive the leaky-integrate-and-ﬁre (LIF) ac- tivation functions converted from their non-spiking LSTM counterparts which justiﬁes the need to jointly optimize the weights, threshold, and leak parameter. We also propose a pipelined parallel processing scheme which hides the SNN time steps, signiﬁcantly improving system latency, especially for long sequences. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), in con- trast to expensive multiply-and-accumulates (MAC) needed for ANNs, except for the input layer when using direct encod- ing, yielding signiﬁcant improvements in energy efﬁciency. We evaluate our framework on sequential learning tasks in- cluding temporal MNIST, Google Speech Commands (GSC), and UCI Smartphone datasets on different LSTM architec- tures. We obtain test accuracy of 94.75% with only 2 time steps with direct encoding on the GSC dataset with ∼4.1× lower energy than an iso-architecture standard LSTM. Introduction & Related Work In contrast to the neurons in ANNs, the neurons in Spiking Neural Networks (SNNs) are biologically inspired, receiv- ing and transmitting information via spikes. SNNs promise higher energy-efﬁciency than ANNs due to their high ac- tivation sparsity and event-driven spike-based computation (Diehl et al. 2016b) which helps avoid the costly multipli- cation operations that dominate ANNs. To handle multi-bit inputs, such as typical in traditional datasets and real-life sensor-based applications, however, the inputs are often spike encoded in the temporal domain using rate coding (Diehl et al. 2016b), temporal coding (Comsa et al. 2020), or rank-order coding (Kheradpisheh et al. 2020). Alternatively, instead of spike encoding the inputs, some researchers explored directly feeding the analog pixel values in the ﬁrst convolutional layer, and thereby, emitting spikes only in the subsequent layers (Rathi et al. 2020b). This can dramatically reduce the number * Work done at University of Southern California of time steps needed to achieve the state-of-the-art accuracy, but comes at the cost that the ﬁrst layer now requires MACs (Rathi et al. 2020b; Datta et al. 2022; Kundu et al. 2021). However, all these encoding techniques increase the end- to-end latency (proportional to the number of time steps) compared to their non-spiking counterparts. In addition to accommodating various forms of spike en- coding, supervised learning algorithms for SNNs, such as surrogate gradient learning (SGL) have overcome various roadblocks associated with the discontinuous derivative of the spike activation function (Lee et al. 2016; Kim and Panda 2021b; Neftci, Mostafa, and Zenke 2019; Panda et al. 2020). It is also commonly agreed that SNNs following the integrate- and-ﬁre (IF) compute model can be converted from ANNs with low error by approximating the activation value of ReLU neurons with the ﬁring rate of spiking neurons (Sengupta et al. 2019; Rathi et al. 2020a; Diehl et al. 2016b). SNNs trained using ANN-to-SNN conversion, coupled with SGL, have been able to perform similar to SOTA CNNs in terms of test accuracy in traditional image recognition tasks (Rathi et al. 2020b,a) with signiﬁcant advantages in compute efﬁciency. Previous works (Rathi et al. 2020b; Datta et al. 2021; Kundu et al. 2021) have adopted SGL to jointly train the threshold and leak values to improve the accuracy-latency tradeoff but without any analytical justiﬁcation. Inspite of numerous innovations in SNN training algo- rithms for static (Panda and Roy 2016; Panda et al. 2020; Rathi et al. 2020b,a; Kim and Panda 2021b) and dynamic vision tasks (Kim and Panda 2021a; Li et al. 2022), there has been relatively fewer research that target SNNs for sequence learning tasks. Among the existing works, some are limited to the use of spiking inputs (Rezaabad and Vishwanath 2020; Ponghiran and Roy 2021b)which might not represent several real-world use cases. Furthermore, some (Deng and Gu 2021; Moritz, Hori, and Roux 2019; Diehl et al. 2016a) propose to yield SNNs from vanilla RNNs which has been shown to yield a large accuracy drop for large-scale sequence learning tasks, as they are unable to model temporal dependencies for long sequences. Others (Ponghiran and Roy 2021a) use the same input expansion approach for spike encoding and yield SNNs which requires serial processing for each input in the sequence, severely increasing total latency. A more recent work (Ponghiran and Roy 2021b) proposed a more complex neuron model compared to the popular IF or leaky- integrate-and-ﬁre (LIF) model, to improve the recurrence dynamics for sequential learning. Additionally, it lets the hidden activation maps be multi-bit (as opposed to binary spikes) which improves training, but requires multiplications that reduces energy efﬁciency compared to the multiplier-less arXiv:2210.12613v1 [cs.NE] 23 Oct 2022