1 A Review of Sound Source Localization with Deep Learning Methods Pierre-Amaury Grumiaux, Sr ¯ dan Kiti´ c, Laurent Girin, and Alexandre Guérin Abstract—This article is a review on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic envi- ronment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics. Index Terms—Sound source localization, deep learning, neural networks, review, survey. I. I NTRODUCTION S OUND source localization (SSL) is the problem of es- timating the position of one or several sound sources relative to some arbitrary reference position, which is generally the position of the recording microphone array, based on the recorded multichannel acoustic signals. In most practical cases, SSL is simplified to the estimation of the sources’ Direction of Arrival (DoA), i.e. it focuses on the estimation of azimuth and elevation angles, without estimating the dis- tance to the microphone array. 1 Sound source localization has numerous practical applications, for instance in source separation [1], speech recognition [2], speech enhancement [3] or human-robot interaction [4]. As more detailed in the following, in this paper we focus on sound sources in the audible range (typically speech and audio signals) in indoor (office or domestic) environments. Although SSL is a longstanding and widely researched topic [5], [6], [7], it remains a very challenging problem to date. Traditional SSL methods are based on signal/channel mod- els and signal processing techniques. Although they showed notable advances in the domain over the years, they are known to perform poorly in difficult yet common scenarios where noise, reverberation and several simultaneously emitting sound sources may be present [8], [9]. In the last decade, the potential of data-driven deep learning (DL) techniques for P.-A. Grumiaux is with Orange Labs, 35510 Cesson-Sévigné, France, and Univ. Grenoble Alpes, GIPSA-lab, 38000 Grenoble, France. Email: pierreamaury.grumiaux@orange.com S. Kiti´ c and A. Guérin are with Orange Labs, 35510 Cesson-Sévigné, France. Emails: srdan.kitic@orange.com, alexandre.guerin@orange.com L. Girin is with Univ. Grenoble Alpes, GIPSA-lab, Grenoble-INP, CNRS, 38000 Grenoble, France. Email: laurent.girin@grenoble-inp.fr 1 Therefore, unless otherwise specified, in this article we use the terms SSL and DoA estimation interchangeably. addressing such difficult scenarios has raised an increasing interest. As a result, more and more SSL systems based on deep neural networks (DNNs) are proposed each year. Most of these studies have indicated the superiority of DNN models over conventional 2 SSL methods, which has further fueled the expansion of scientific papers on deep learning applied to SSL. For example, in the last three years (2019 to 2021), we have witnessed a threefold increase in the number of corresponding publications. In the meantime, there has been no comprehensive survey of the existing approaches, which we deem extremely useful for researchers and practitioners in the domain. Although we can find reviews mostly focused on conventional methods, e.g. [6], [7], [9], [10], to the best of our knowledge only a very few have explicitly targeted sound source localization by deep learning methods. In [11], the authors present a short survey of several existing DL models and datasets for SSL before proposing a DL architecture of their own. References [12] and [13] are very interesting overviews of machine learning applied to various problems in audio and acoustics. Nevertheless, only a short portion of each is dedicated to SSL with deep neural networks. A. Aim of the paper The goal of the present paper is to fill this gap, and provide a thorough overview of the SSL literature using deep learning techniques. More precisely, we examined more than 120 more or less recent papers (published after 2013) and we classify and discuss the different approaches in terms of characteristics of the employed methods and addressed configurations (e.g. single-source vs multi-source localization setup or neural network architecture, the exact list is given in Section I-C). In other words, we present a taxonomy of the DL-based SSL literature. At the end of the paper, we present a summary of the review in the form of two large tables (one for the period 2013-2019 and one for 2020-2021). All methods that we reviewed are reported in those tables with a summary of their characteristics presented in different columns. This enables the reader to rapidly select the subset of methods having a given set of characteristic, if he/she is interested into that particular type of methods. Note that in this review paper, we do not aim to evaluate and compare the performance of the different systems. Due to the large number of neural-based SSL papers and diversity of configurations, such a contribution would be very difficult and cumbersome (albeit very useful), especially because the 2 Hereafter, the term “conventional” is used to refer to SSL systems that are based on traditional signal processing techniques, and not on DNNs. arXiv:2109.03465v1 [cs.SD] 8 Sep 2021