IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 A Hybrid Approach for Speaker Tracking Based on TDOA and Data-Driven Models Bracha Laufer-Goldshtein, Student Member, IEEE, Ronen Talmon, Member, IEEE and Sharon Gannot, Senior Member, IEEE Abstract—The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid al- gorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding to high-dimensional acoustic features representing the full re- verberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data- driven propagation model for the source movement. In the observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions. Index Terms—speaker tracking, time difference of arrival (TDOA), relative transfer function (RTF), extended Kalman filter (EKF), Gaussian process. I. I NTRODUCTION Speaker localization and tracking in reverberant enclosures is required in various audio applications, including: automatic camera steering in teleconferencing [1], beamforming [2], source separation [3], [4] and robot audition [5], [6]. Con- ventional localization methods are implemented by either a single-step optimization directly on the measured signals, or a dual-step approach. In the first category, the position is estimated for example, by a grid search over the output power of a beamformer steered to all potential locations [7], [8], or by high-resolution methods such as the multiple signal classification (MUSIC) algorithm [9]. In dual-step approaches, the first stage is estimating the TDOAs of several microphone pairs [10]–[12]. Then, in the second stage, the TDOA readings are combined to perform the actual localization [13], [14]. Bracha Laufer-Goldshtein and Sharon Gannot are with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan, 5290002, Israel (e-mail: Bracha.Laufer@biu.ac.il, Sharon.Gannot@biu.ac.il); Ronen Talmon is with the Viterbi Faculty of Electrical Engineering, The Technion-Israel In- stitute of Technology, Technion City, Haifa 32000, Israel, (e-mail: ro- nen@ee.technion.ac.il). Bracha Laufer-Goldshtein is supported by the Adams Foundation of the Israel Academy of Sciences and Humanities. This work was supported in part by a Grant from a joint Lower Saxony- Israeli Project financially supported by the State of Lower Saxony. In a tracking scenario, the source is moving in the en- closure in approximately continuous trajectory, implying de- pendence between source positions in successive time steps. Bayesian inference approaches, which model the varying source position as a stochastic process, are widely used. These methods commonly rely on estimated TDOAs, leading to nonlinear and non-Gaussian models, which can be solved, for example, using the unscented Kalman filter, the extended Kalman filter (EKF) [15], and particle filters [16]–[18]. In real environments, the presence of noise or rever- berations often yields unreliable observations with spurious peaks, which may lead to severe performance degradation. Several attempts to mitigate the harmful effect of noise and reverberations, were made. In [19] an extended particle filter (EPF) solution was proposed, where an EKF is used to derive an optimal importance function for a particle filter. A multiple- hypothesis model accounting for the multipath nature of the sound propagation in reverberant enclosures was presented in [16], and was combined with an EPF in [20]. In [21], [22] a tracker was proposed based on a probability hypothesis density (PHD) filter, which is a first moment approximation of the target probability density. Robust tracking methods were also proposed using sensor networks with special structures, such as spherical microphone arrays [23] and distributed net- works [24], [25]. In [26] a robust tracker based on a distributed unscented Kalman filter was proposed, in which an interacting multiple model [27] is used for accommodating the different possible motion dynamics of the speaker, yielding a smoothed trajectory of the speaker’s movement in noisy and reverberant environments. Another approach to enhance the localization robustness is to fuse several observation modalities, as was demonstrated in audio-visual tracking methods [28]–[31]. Localization and tracking capabilities can be enhanced using model-based methods, assuming certain structures of either the speech signal or the acoustic channels. In [32] an autoregressive (AR) modelling for the speech components was used, and in [33], [34] the sources were modelled as sums of harmonically related sinusoids, which can describe many musical instruments and voiced speech. A model for the early reflections of the acoustic channels was presented in [35], based of which the early reflections were iteratively estimated. These models often rely on approximated physical and statistical assumptions, which do not always meet the practical conditions in complex real-world scenarios, with high levels of noise and reverberations. Recently, there is an attempt to overcome these limitations by applying data-driven models, rather than predefined physical and statistical models [36]– This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2018.2790707 Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.