Cost Function for Sound Source Localization with Arbitrary Microphone Arrays Ivan J. Tashev Microsoft Research Labs Redmond, WA 98051, USA ivantash@microsoft.com Long Le Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, IL 61820, USA longle1@illinois.edu Vani Gopalakrishna, Andrew Lovitt Microsoft Corporation Redmond, WA 98051, USA {vanig, anlovi}@microsoft.com Abstract—The sound source localizer is an important part of any microphone array processing block. Its major purpose is to determine the direction of arrival of the sound source and let a beamformer aim its beam towards this direction. In addition, the direction of arrival can be used for meetings diarization, pointing a camera, sound source separation. Multiple algorithms and approaches exist, targeting different settings and microphone arrays. In this paper we treat the sound source localizer as a classifier and use as features the phase differences and magnitude proportions in the microphone channels. To determine the proper mix, we propose a novel cost function to measure the localization capability. The resulting algorithm is fast and suitable for real– time implementations. It works well with different microphone array geometries with both omnidirectional and unidirectional microphones. Index Terms—sound source localization, phase differences, magnitude proportions, cost function, various geometries. I. I NTRODUCTION Localization of sound sources is a part of any microphone array which uses beamsteering to point the listening beam towards the direction of the sound source. The output of the localizer is also used by post-filters to additionally suppress unwanted sound sources [1], [2]. The idea of spatial noise suppression was proposed in [3] and further developed in [4], where a de facto sound source localizer per frequency bin is used to suppress sound sources coming from undesired directions for each frequency bin separately. The probability of the sound source, coming from a given direction, is estimated based on the phase differences only. A similar approach, adapted for a very small microphone array with directional microphones, was proposed in [5]. It uses both magnitudes and phases to distinguish desired from undesired sound sources. Therefore, localization of sounds with microphone arrays is a well–studied area with multiple algorithms defined over the years. The overall architecture of a sound source localizer is described in [6]. Typically, a sound source localizer (SSL) works in the frequency domain and consists of a voice activity detector, a per-frame sound source localizer (when an activity is detected), and a sound source tracker (across multiple frames). In this paper we will discuss algorithms for the per- frame SSL only. It is assumed that the microphone array geometry is known in advance: position, type and orientation for each of the microphones. There are two major approaches for the sound source direction of arrival (DOA) estimation with microphone arrays: delay estimation and steered response power (SRP). The first approach splits the microphone array on pairs and estimates the time difference of arrival, usually using Generalized Cross- correlation function (GCC) [7]. The DOA for a pair can be determined based on the estimated time difference and the distance between two microphones. Averaging the estimated DOAs across all pairs provides poor results, as some of the pairs may have detected stronger reflections. In [8] the DOA is determined by combining the interpolated values of the GCCs for given delays corresponding to a hypothetical DOA. By scanning the space of DOAs, the direction with highest combined correlation can be determined. Since this process is computationally expensive, a coarse-to-fine scanning is proposed in [9]. SRP approaches vary from simply trying a set of beams and determining the direction from where the response power is highest, to a more sophisticated weighting of frequency bins based on the noise model [10]. One of the most commonly used and precise SSL algorithms is MUltiple SIgnal Classification (MUSIC) [11]. It is also one of the most computationally expensive. A good overview of the classic sound source localization algorithms can be found in Chapter 8 of [12]. In this paper we address two problems. The first is that most of the classic SSL algorithms are computationally expensive (GCC, FFTs, interpolations), which makes it difficult to be implemented in real-time systems. The second problem is that in many cases the SSL algorithms are tailored for a given microphone array geometry (linear, circular; large, small; omnidirectional or unidirectional microphones) and their per- formance degrades for different geometries. To address this degradation of performance, especially for microphone arrays with directional microphones, pointing in different directions, we propose to utilize the magnitudes as a feature. Practically all SSL algorithms assume omnidirectional microphones and for far-field sound sources there is no substantial difference in magnitudes across the channels. We assume that the same algorithm will be used for handling different microphone array geometries, and that the geometry is known before the process- ing starts, which allows faster execution during runtime. The general idea is to extract a set of features (differences in phases