Detection and Classiﬁcation of Acoustic Scenes and Events 2020 Challenge MODULATION SPECTRAL SIGNAL REPRESENTATION AND I-VECTORS FOR ANOMALOUS SOUND DETECTION Technical Report Parth Tiwari 13 , Yash Jain 2 , Anderson Avila 3 , Jo˜ ao Monteiro 3 , Shruti Kshirsagar 3 , Amr Gaballah 3 , Tiago H. Falk 3 1 Department of Industrial and Systems Engineering, IIT Kharagpur, India 2 Department of Mathematics, IIT Kharagpur, India 3 MuSAE Lab, Institut National de la Recherche Scientiﬁque - Centre EMT, Montreal, Canada ABSTRACT This report summarizes our submission for Task-2 of the DCASE 2020 Challenge. We propose two different anomalous sound de- tection systems, one based on features extracted from a modula- tion spectral signal representation and the other based on i-vectors extracted from mel-band features. The ﬁrst system uses a nearest neighbour graph to construct clusters which capture local varia- tions in the training data. Anomalies are then identiﬁed based on their distance from the cluster centroids. The second system uses i-vectors extracted from mel-band spectra for training a Gaussian Mixture Model. Anomalies are then identiﬁed using their negative log likelihood. Both these methods show signiﬁcant improvement over the DCASE Challenge baseline AUC scores, with an average improvement of 6% across all machines. An ensemble of the two systems is shown to further improve the average performance by 11% over the baseline. Index Terms— i-Vectors, Amplitude-Modulation Spectrums, Graph, Clustering, Gaussian Mixture Models 1. INTRODUCTION Monitoring industrial machinery can prevent the production of faulty products and decrease the chances of machine breakdown. Anomalous sounds can indicate symptoms of unwanted activity, hence, Anomalous Sound Detection (ASD) systems can potentially be used for real time monitoring of machines. Unsupervised ASD systems rely on only “normal” sounds for identifying anomalies. Developing ASD systems in an unsupervised manner is essential, as: (i) the nature of anomalies may not be known beforehand, and (ii) deliberately destroying expensive devices is impractical from a development cost perspective. In addition, the frequency at which anomalies occur is low and the variability in the type of anomaly is high, therefore, developing balanced datasets for supervised learn- ing is difﬁcult. In our proposed systems, we focus on using features which are able to capture anomalous behaviour. Simple machine learning al- gorithms when used on top of these features are able to beat the baseline performance[1]. In our ﬁrst system, we propose an outlier detection method which is similar to a nearest neighbour search. In this method, clusters of normal sounds are formed using a nearest neighbour graph to capture variations in the normal working sounds of a machine. Anomalies are then identiﬁed based on their distance from these clusters. Modulation spectrum features are used for this system. The features are extracted from the so-called modulation spectrum (MS) signal representation, which quantiﬁes the rate of change of the signal spectral components over time. These features have previously been useful for stress detection in speech [2], for speech enhancement [3], and room acoustic characterization [4], to name a few applications. In our second system, in turn, we use i-vectors and Gaussian Mixture Models (GMM) for anomaly detection. i-vectors have been widely used for speech applications, including speech, speaker, lan- guage, and accent recognition. We extract i-vectors from MFCC features and use them to train GMMs for anomaly detection. The negative log likelihood for a sample is used as its anomaly scores. Lastly, an ensemble of these two systems is also experimented with. 2. SYSTEM DESCRIPTION 2.1. System 1 - Graph Clustering using Modulation Spectro- grams 2.1.1. Pre-processing and Feature Extraction Modulation spectrum corresponds to an auditory spectro- temporal representation that captures long-term dynamics of an audio signal. The pipeline proposed in [5] is used to extract modulation spectro- grams. Prior to feature extraction, noise reduction is performed on the signal through a spectral gating method (using noisereduce 1 in Python), described as follows: 100 normal training sound clips for a machine-id are averaged and used as a noise clip for that machine- id. An FFT is calculated over this noise clip and statistics including the mean power, are tabulated for each frequency band. A threshold for each frequency band is calculated based upon the statistics. An FFT is calculated over the signal. A mask is determined by com- paring the signal FFT to the threshold. The mask is smoothed with a ﬁlter over frequency and time. The mask is appled to the FFT of the signal, and is inverted. The speech activity level is normalized to -26 dBov (dB over- load), after noise removal thus eliminating unwanted energy varia- tions caused by different loudness levels in the speech signal. Next, the pre-processed speech signal ˆ x(n) is ﬁltered by a 60-channel gammatone ﬁlterbank, simulating cochlear processing [6]. The ﬁrst ﬁlter of the ﬁlterbank is centered at 125 Hz and the last one at at just below half of the sampling rate. Each ﬁlter bandwidth follows the 1 https://pypi.org/project/noisereduce/