Detection and Classification of Acoustic Scenes and Events 2020 Challenge URBAN SOUND CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS FOR DCASE 2020 CHALLENGE Technical Report Itxasne Díez Gaspón Peio González, Ibon Saratxaga † Noismart C/ Ogoño nº1 5º Piso Edificio Elkartegi 48930 Getxo, Las Arenas, Bizkaia itxasne@noismart.com HiTZ Center – Aholab University of the Basque Country UPV/EHU 1 Torres Quevedo Sq. 48013 Bilbao, Spain ibon.saratxaga@ehu.eus ABSTRACT This technical report describes our system proposed for Task 5 - Urban Sound Tagging. The system has a core architecture based on Convolutional Neural Networks. This neural network uses log melspectrogram features as input and this input is processed by two CNN layers. The output of the convolutional stack is pro- cessed by several fully connected layers plus an output layer to produce the classification decision. Spatiotemporal context data is also available and we propose a multi-input architecture, with two input branches that are merged for the final processing. The spatiotemporal context information is processed by an additional neural network of 2 fully connected layers. Its output is merged with the output of the CNN stack and the resulting data is fed to the fully connected output block. In this report, we describe the proposed models in detail and compare them to the baseline approach using the provided development da- tasets. Finally, we present the results obtained with the validation split from the dataset. Index Terms— Urban Sound Tagging, CNN, DNN, multi-input 1. INTRODUCTION In recent years, there has been an increase in the develop- ment of Smart Cities, where automated monitoring systems are intended to manage aspects such as traffic and pollution more ef- ficiently. One of the major challenges that researchers have to face is to detect segments of different sound events in large recordings obtained from continuously operating sensors deployed in the field. For the last year, we have been working on the development of urban sound detection and classification system taking into ac- count the research line of Piczak[1], using Convolutional Neural Networks for audio classification, and the research line of Bello This work has been supported by the Dept. of Economic Development and Infrastructure of the Basque Government (BIKAINTEK) † This work has been partially supported by the Dept. of Education of the Basque Government code IT1355-19 et al[2] on the deployment of a sensor network in an urban envi- ronment, that integrates machine listening technics to process the audio automatically. The objective of task 5 is aligned with this research line: it aims to detect and classify urban sound recordings using not only audio recordings but also contextual information of the recording environment: place, time, sensor, etc., the so-called spatiotem- poral context (STC) data. We present two systems for this task. The first one is a model for classifying urban sounds using only audio information as input. The second one is a multi-input model that uses both audio and spatiotemporal context (STC) data as in- put. The present report is divided in the following sections: in section 2, we describe the task in detail. In section 3, we present the proposed model, the experiments that have been done are de- scribed in section 4 and the results in section 5. The report ends with some conclusions. 2. TASK DESCRIPTION Task-5, Urban Sound Tagging, aims to predict urban sound tags not only using audio signals but also spatiotemporal context data. The task provides all the audio recordings files and metadata that gathers all the information about the recording time, localiza- tion and tag annotation. In the following paragraphs, we describe de audio and STC dataset, provided for the task. 2.1 Audio Dataset The provided audios have been recorded using the sensor network deployed in New York[3]. All the audios were recorded with iden- tical microphones, gain settings and a duration of 10 seconds with a sample rate of 48 KHz. The recordings are grouped into a train set with 13,538 recordings from 35 sensors, validate set with 4,308 recordings from 9 sensors and test set with 669 recordings of 48 sensors. For the evaluation, dataset DCASE also provides a new metadata with no tags. Train and validate set is used in the development stage while test set is used for obtaining final model