Detection and Classiﬁcation of Acoustic Scenes and Events 2020 Challenge TASK 3 DCASE 2020: SOUND EVENT LOCALIZATION AND DETECTION USING RESIDUAL SQUEEZE-EXCITATION CNNS Technical Report Javier Naranjo-Alcazar 1,2 , Sergi Perez-Castanos 1 , Jose Ferrandis 1 , Pedro Zuccarello 1 , Maximo Cobos 2 1 Visualfy, Benisano, Spain, {javier.naranjo, sergi.perez, jose.ferrandis, pedro.zuccarello}@visualfy.com, 2 Universitat de Val` encia, Burjassot, Spain, {janal2}@alumni.uv.es, {maximo.cobos}@uv.es ABSTRACT Sound Event Localization and Detection (SELD) is a problem re- lated to the ﬁeld of machine listening whose objective is to rec- ognize individual sound events, detect their temporal activity, and estimate their spatial location. Thanks to the emergence of more hard-labeled audio datasets, Deep Learning techniques have be- come state-of-the-art solutions. The most common ones are those that implement a convolutional recurrent network (CRNN) having previously transformed the audio signal into multichannel 2D rep- resentation. In the context of this problem, the input to the network, usually, has many more channels than in other problems related to machine listening. This is because the audio is recorded by an ar- ray of microphones.Some frequency representation is obtained for each of them together with some additional representations, such as the generalized cross-correlation (GCC), whose objective is the assessment of the relationship between channels. This work aims to improve the accuracy results of the baseline CRNN by adding residual squeeze-excitation (SE) blocks in the convolutional part of the CRNN. The followed procedure involves a grid search of the parameter ratio of the residual SE block, whereas the hyperparame- ters of the network remain the same as in the baseline. Experiments show that by simply introducing the residual SE blocks, the results obtained in the development phase clearly exceed the baseline. Index Terms— SELD, Deep Learning, Convolutional Re- current Neural Network, Squeeze-Excitation, Residual learning, DCASE2020 1. INTRODUCTION Sound Event Localization and Detection (SELD) tries to solve both problems, related to machine listening, of tracking the activation of different classes (detection) and the spatial localization of sound events at the same time [1, 2, 3, 4]. For an intelligent system to be able to calculate such outputs, the audio must have been recorded by an array of microphones (multichannel audio input). SELD ﬁrst appeared in DCASE 2019 edition as an evolution of the Sound Event Detection (SED) problem. SED was presented in the ﬁrst edition of the DCASE in 2013 [5] and was presented again as a task in the 2016 [6] and 2017 [7] editions. The objective of this task is the individual detection of particular events that occur in a scene. The nature of this problem is directly confronted with the polyphonic nature of audio [8, 9], i.e. the overlapping of several events in the same time period. SELD task DCASE2020 edition can be seen as a modiﬁcation from 2019 DCASE challenge. Modiﬁca- tions done in this edition have been the presented dataset, that has been increased, and the detection metrics that are computed with a 20 o threshold from the reference for true positives. Regarding the dataset called TAU-NIGENS Spatial Sound Events 2020 [10], it should be observed that each scene has been recorded in two different formats: using an array of 4 microphones (MIC) and with ﬁrst-order Ambisonics (FOA). In both recording formats (MIC or FOA), each sound event in the scene is associated with a direction-of-arrival (DoA) to the recording point, and tempo- ral onset and offset times. The number of classes to be detected are 14. Some of these classes are: piano, male speech, female speech, barking dong, among others. As it can be noticed, sounds belong- ing to these classes are easily found in domestic environments. This encourages the proposal of solutions that could improve real-world applications such as home assistants [11]. For this submission, MIC recording format has been used. In the MIC setup, the microphones have been placed on an spherical acoustically-hard bafﬂe, and their positions described in spherical coordinates, φ, θ and r are as follows: • M1: (45 o , 35 o , 4.2cm) • M2: (-45 o , -35 o , 4.2cm) • M3: (135 o , -35 o , 4.2cm) • M4: (-135 o , 35 o , 4.2cm) Some of the modiﬁcations of the dataset presented in this edi- tion with respect to the previous one are the following: • (2x) Large lecture halls with inclined ﬂoor. Ventilation noise. • (2x) Modern classrooms with multiple seating tables and car- pet ﬂooring. Ventilation noise. • (2x) Meeting rooms with hard ﬂoor and partially glass walls. Ventilation noise. • (2x) Old-style large classrooms with hard ﬂoor and rows of desks. Ventilation noise. • Large open space in underground bomb shelter, with plastic ﬂoor and rock walls. Ventilation noise. • Large open gym space. People using weights and gym equip- ment. The dataset is divided into several folders under development. 4 folders (3-6) are used for training, folder 2 for validation and 1 for testing. Regarding the difference in the metrics used in this edition, it is intended to have a more representative calculation of the problem by doing a joint evaluation of location and detection [12]. A pre- diction will be considered correct if both are of the same class and