Urban Sound Classiﬁcation using Long Short-Term Memory Neural Network Iurii Lezhenin, Natalia Bogach Institute of Computer Science and Technology Peter the Great St.Petersburg Polytechnic University St.Petersburg, 195251, Russia Email: {lezhenin, bogach}@kspt.icc.spbstu.ru Evgeny Pyshkin Software Engineering Lab University of Aizu Aizu-Wakamatsu, 965-8580, Japan Email: pyshe@u-aizu.ac.jp Abstract—Environmental sound classiﬁcation has received more attention in recent years. Analysis of environmental sounds is difﬁcult because of its unstructured nature. However, the pres- ence of strong spectro-temporal patterns makes the classiﬁcation possible. Since LSTM neural networks are efﬁcient at learning temporal dependencies we propose and examine a LSTM model for urban sound classiﬁcation. The model is trained on magnitude mel-spectrograms extracted from UrbanSound8K dataset audio. The proposed network is evaluated using 5-fold cross-validation and compared with the baseline CNN. It is shown that the LSTM model outperforms a set of existing solutions and is more accurate and conﬁdent than the CNN. Index Terms—environmental sound classiﬁcation, long short- term memory, convolutional neural networks, UrbanSound8K dataset I. I NTRODUCTION A UDIO recognition algorithms are traditionally used for the tasks of speech and music signal processing. Mean- while, the problems of environmental sound recognition and classiﬁcation have received much attention in recent years. There are multiple applications already proposed in a big variety of industries, including surveillance [1], [2], audio scene recognition for robot navigation [3], acoustic monitor- ing of natural and artiﬁcial environment [4]–[6]. In digitally transformed society [7], soundscape models create a research perspective in smart city domain. City noise managing signif- icantly contributes to a healthy and safe living environment in the big cities [8]. In travel centric systems, city sounds may enter the emerging solutions to develop and share journey experience [9], [10]. Assisting technologies for people with disabilities and, in particular, navigation systems for blind or visually impaired people effectively incorporate urban sound models [11]. Environmental sound analysis is more complex than speech and music processing because of unstructured nature of sounds. There are no meaningful sequences of elementary blocks like phonemes or strong stationary patterns such as melody or rhythm. However, environmental sounds may in- clude strong spectro-temporal signatures. Thus, it is important to consider non-stationary aspects of signal and capture its variation in both time and frequency domains. The classiﬁcation of environmental sounds is often split into auditory scene classiﬁcation and sound classiﬁcation by its source. But, both problems share the similar approaches. The methods used involve k-Nearest Neighbors (k-NN) al- gorithm, Support Vector Machine (SVM), Gaussian Mix- ture Model (GMM) and Hidden Markov Model (HMM) in combination with features engineered by signal processing techniques, e.g. Mel-Frequency Cepstral Coefﬁcients (MFCC), Discrete Wavelet Transform (DWT) coefﬁcients and Matching Pursuit (MP) features [12]–[14]. In contrast with described approaches, deep neural networks (DNN) allow to facilitate feature engineering keeping classiﬁcation accuracy and even outperform the conventional solutions [15]. In particular, being able to capture spectro-temporal patterns from spectogram- like input convolutional neural networks (CNN) have high performance [16]–[19]. Long short-term memory (LSTM) networks is the other type of neural network architectures that is exploited for sound classiﬁcation [20], as well as the combinations of LSTM and CNN [21], [22]. LSTM networks are recurrent neural networks (RNN) that use the contextual information over long time intervals to map the input sequence to the output. LSTM network is a general solution, efﬁcient at learning temporal dependen- cies. Its application is beneﬁcial in a variety of tasks, such as phoneme classiﬁcation [23], speech recognition [24] and speech synthesis [25]. LSTM network combined with CNN was also successfully used for video classiﬁcation [26]. The applicability of LSTM for sound classiﬁcation hasn’t been fully investigated so far. In this paper we examine a LSTM model to improve understanding of its applicabil- ity speciﬁcally for urban sounds classiﬁcation using Urban- Sound8K dataset [27]. Table A1 in Appendix summarizes some of the existing solutions where models are evaluated on UrbanSound8K. The baseline accuracy of 70% was ob- tained with SVM processing mel-bands and MFCC statisti- cally summarized across the time [27]. The unsupervised fea- ture learning using Spherical K-Means (SKM) performed on PCA-whitened log-scaled mel-spectrograms allows to achieve 73.6% accuracy [28]. CNNs of different architectures trained on log-scaled mel-spectrogram frames provide 73% of accu- racy and 79% with data augmentation [16], [17]. The LSTM based CRNN for urban sound classiﬁcation demonstrates 79.06% accuracy using raw waveforms [22]. The accuracy of 93% was shown by GoogLeNet trained on combination Proceedings of the Federated Conference on Computer Science and Information Systems pp. 57–60 DOI: 10.15439/2019F185 ISSN 2300-5963 ACSIS, Vol. 18 IEEE Catalog Number: CFP1985N-ART c 2019, PTI 57