Detection and Classiﬁcation of Acoustic Scenes and Events 2020 Challenge DEVELOPMENT OF THE INRS-EMT SCENE CLASSIFICATION SYSTEMS FOR THE 2020 EDITION OF THE DCASE CHALLENGE (TASKS 1A AND 1B) Technical Report Amr Gaballah 1* , Anderson Avila 1* , Joao Monteiro 1* , Parth Tiwari 1,2* , Shruti Kshirsagar 1* , Tiago H. Falk 1 1 Institut National de la Recherche Scientiﬁque - Centre EMT, Montreal - Canada 2 Dept. of Industrial and Systems Engineering, IIT Kharagpur, India ABSTRACT In this report, we provide a brief overview of a set of submissions for the scene classiﬁcation sub-tasks of the 2020 edition of the DCASE challenge. Our submissions comprise efforts at the feature representation level, where we explored the use of modulation spec- tra and log-mel ﬁlter banks, as well as modeling strategies, where recent convolutional deep neural network models were used. Re- sults on the Challenge validation set show several of the submitted methods outperforming the baseline model. Index Terms— Scene classiﬁcation, i-vectors, Modulation spectra, Convolutional models 1. SUMMARY OF CONTRIBUTIONS 1.1. Task 1A We submit systems consisting of convolutional models trained on top of spectral representations of audio, namely: 1. System S1: The ﬁrst system builds on a standard ResNet- 18 [1] and removes parts of its layers, which we empirically found to improve performance on the validation partition. We refer to this model as ResNet-12. Kaldi style log-mel ﬁlter banks are then used in the inputs and treated as single channel images, i.e. spatial-temporal convolutions are em- ployed. Pre-processing steps besides feature extraction con- sist of data augmentation, which are performed in two steps: 1) prior to feature computation using sox distortions in gain and tempo; 2) directly on the spectra by randomly dropping out continuous chunks along both the time and frequency di- mension, as well as addition of Gaussian noise. All augmen- tations are performed in an online fashion, and every time a given recording is sampled, we randomly decide whether it will be distorted or not such that half of the examples are presented to the model after some sort of augmentation was performed on average. 2. System S2: Our second submission makes use of time de- lay neural networks (TDNN) [2]. Such models are often used within the context of speech recognition for compu- tation of frame-level representations. Utterance-level vari- ations of TDNNs were shown in recent literature to be ef- fective in computing speaker- or language-dependent repre- sentations if some sort of temporal pooling is further used. *Equal contribution. Authors listed in alphabetical order. We thus leverage that architecture for the task considered herein and train an x-vector TDNN [3] with statistical tempo- ral pooling on top of the same representations discussed for system 1, employing exactly the same augmentation strategy described above. The TDNN we employed is made up of 5 temporal dilated convolutional layers followed by temporal pooling and 2 dense layers. We further remark that, in the case of both system 1 and 2, we initialize models from pre- trained versions on the data released for task 1B, which we observed improved validation performance in some classes. 3. System S3: Our third submission is once more based on a ResNet architecture. In this case, we employed a ResNet-18 as is, but on top of modulation spectra computed from the log-mel ﬁlter banks described before. The modulation spec- tra are obtained by computing the STFT over each frequency bin of the mel-spectra, computed in advance. We average the results across time and end up with a representation with two dimensions: acoustic vs. modulation frequency. The same types of augmentations were used in this case as well. No pre-training step was performed in this case and the ResNet- 18 was trained from scratch. 4. System S4: Our fourth submission corresponds to a score- level fusion of ﬁve systems. We thus considered the three systems discussed above, and added a simple 2-layered con- volutional model and further included a ResNet-12 trained from scratch, and in both cases the log-Mel spectra were used as inputs to the models. Fusion is performed in a sim- ple averaging scheme: given a test example, we project it in the probability simplex by forwarding it into each of the ﬁve considered models, and average the ﬁnal results. Our ﬁnal prediction is thus given by the most likely class according to the combined set of scores. 1.2. Task 1B For this task, we employ a small ReLU activated 2-layered convo- lutional model, trained on top of log-mel ﬁlter banks extracted in Kaldi-style. Each convolutional layer is followed by batch normal- ization. Features are computed such that 40 log-mel ﬁlter banks are extracted using the Kaldi compliant API of torchaudio 1 . Data augmentations are performed in order to increase the diversity of train data, which we do by randomly deciding when to augment, and further randomly deciding which kinds of distortions will be 1 https://pytorch.org/audio/compliance.kaldi.html