SPEAKER DIRECTION FINDING: A COMPARISON OF PRACTICAL APPROACHES IN REVERBERANT ENVIRONMENT Afsaneh Asaei 1 , Shirin Ghanbari 1 , Mohammad Javad Taghizadeh 2 and Hossein Sameti 3 1 {asaeiaf, sghanb}@itrc.ac.ir 1 Multimedia Group, Iran Telecommunication Research Center, Tehran, Iran 2 Iran Communication Industries, Tehran, Iran 3 Computer Engineering Faculty, Sharif University of Technology, Tehran, Iran ABSTRACT Speaker direction finding techniques have aroused interests due to achieving the capability of receiving high-quality dis- tant signals. Interesting concepts can be achieved through the comparison of such techniques whereby importance is in achieving high quality signals at reasonable complexity rates. With this aim in mind, this paper presents a critical compari- son between two such traditional techniques; Time-Difference of Arrival (TDOA) estimation by Generalized Cross Correla- tion (GCC) and space scanning by Steered Response Power (SRP) of a beamformer. Each is analyzed under diverse con- ditions of noise and reverberation. Simulation results and experiments based on real data have been able to show that SRP with short data segments and due to its characteristic of averaging over the spatial dimension illustrate better accuracy results than that of GCC. These results have instigated a new method in the estimation of the source direction from a set of TDOAs based on spatial curvature collision. This paper dis- cusses how this procedure reduces the computational cost more than 50 times compared to the conventional method of Root Mean Square (RMS) error minimization over the candi- date locations. 1. INTRODUCTION Applications such as automatic camera steering [1], video conferencing [2], hearing aids [2], hands-free speech recognition [3] and speaker identification are just a few applications that can benefit from speaker direc- tion finding algorithms. The primary goal of such sys- tems is in utilizing techniques that ensure accuracy. These techniques can be loosely classified into three general categories: (i) those adopting high resolution spectral concepts, (ii) techniques based upon maximiz- ing the steered response power of a beamformer and (iii) approaches employing time difference of arrival infor- mation. The first case characterizes any localization scheme de- pendent upon applications of the spatio-spectral correla- tion matrix. Interestingly, they are all designed for nar- rowband signals implying complexities within speaker localization [4]. The second strategy is based on maxi- mizing the output power of a steered beamformer. In this case a beamformer can be used to scan over a prede- fined spatial region by adjusting its steering delays [5]. A filtering process can also be employed to increase accuracy whereby filters are designed in such a way to boost the power of the desired signal even if they cost distortion. That is the main distinction between the popular beamforming techniques in speech acquisition systems and that of localization. The last category can be approached into two phases. Firstly it detects a set of Time-Difference of Arrival (TDOA) of the wave-front between different micro- phone pairs mostly based on the Generalized Cross Cor- relation (GCC) function. Then geometrical constraints are used to infer the source position [6]. Due to reduc- tion of computational cost, this technique has aroused many interests. However, pair wise techniques suffer considerably from acoustic multipath propagation. As a solution, various weighting functions have been sug- gested; ML 1 , PHAT 2 , P 3 being the most prominent ones. Organization of the paper is as follows: Section 2 pre- sents the speaker direction finding strategies, including our new strategy. Section 3 discusses our simulation test scenario under which the various discussed tech- niques have been simulated and experimental results are presented. Finally, conclusions are given in section 4. 2. DIRECTION FINDING STRATEGIES Source position in the spherical coordinate system is represented by azimuth, θ, elevation, ϕ and range, ρ. Whenever the source distance is much larger than its array aperture size, range becomes ambiguous and the 1 Maximum Likelihood 2 Phase Transform 3 Pitch Harmonic Proceedings of SPS-DARTS 2007 (the 2007 The third annual IEEE BENELUX/DSP Valley Signal Processing Symposium) 129