Speaker Localization Using Two-Channel Microphone on the SIG-2 Humanoid Robot Ui-Hyun Kim † Toru Takahashi † Tetsuya Ogata † Hiroshi G. Okuno † Graduate School of Informatics, Kyoto University † 1. INTRODUCTION Speaker localization is one of the most important techniques to achieve natural and intelligent human- robot interaction (HRI) because robots need to 1) identify the direction of a talker through the measurements of the acoustic signals from microphones, and 2) watch at the position of a talker for notifying that they are now ready to receive an order or express their interest in conversation. Moreover, speaker localization with two-channel microphones is required for humanoid robots by two reasons; 1) cost reduction and 2) portability. The cost of stereo input devices is much cheaper than that of multi-channel analog-to-digital (AD) devices. Moreover, the market of binaural audition hardware is economically growing. It can be easily embedded on PCs, TVs and other ICT devices. Thus, the development of SL with two microphones is a necessary for robotics [1]. Speaker localization is usually based on voice activity detection (VAD) and sound source localization (SSL). The one of the most popular algorithms for SSL is the generalized cross-correlation method (GCC) and its phase transform (PHAT) weighting. It is well known that GCC-PHAT method performs very well in noise and reverberation environments [2]. Many robot audition systems have been developed with GCC-PHAT method and their performance has generally improved more and more. However, the following two issues of the conventional GCC-PHAT method should be considered and improved: 1) Diffraction of sound wave by the robot’s head in non-free space. 2) Low performance around the lateral direction of sound source with one pair of microphones. In this paper, we describe these two issues as a problem and address their solutions: 1) The formula considered the diffraction of sound wave after assuming that the shape of the robot’s head is a circle. 2) Applying the maximum likelihood (ML)-based direction- of-arrival (DOA) estimator in frequency domain. These solutions implemented and evaluated with experimental results in our speaker localization system using two- channel microphone embedded on the SIG-2 Humanoid robot. 2. DIRECTION-OF-ARRIVAL ESTIMATION This paper employs a time-frequency domain approach with a T-point short-time Fourier transform (STFT). The received signal from m-th microphone can be mathematically modeled as ( , ) ( , ) ( , ) m m X f n S f n N f n α = + (1) where X m (f,n), S(f,n), and N m (f,n) are f-th elements of STFTs of measured signal from m-th microphone, sound source, and uncorrelated additive noise respectively, on n-th time-frame index. α is an attenuation factor, f∈{0, fs / T, ···, fs (T-1) / T} is a frequency, fs is the sampling frequency, and T is the frame size for STFT. 2.1 CONVENTIONAL GCC-PHAT METHOD GCC-PHAT method to estimate the time difference of arrival (TDOA) τ ij between two microphones i and j is derived [3] by ( 1)/ * 2 0 ( , ) ( , ) ( , ) ( , ) , s i j f T T PHAT j f i j f xx R fn G fnX fnX fn e π τ − = =  (2) (, ) [ ( , )], i j ij xx csp tn ISTFT R fn = (3) () arg max( ( , )) ij ij t n csp tn τ ∧ = (4) where * 1 ( , ) , ( , ) ( , ) PHAT j i G fn X fnX fn = (5) Rx i x j is the cross-correlation function, * is the complex conjugate, csp ij is the coefficient of the cross-power spectrum phase analysis (CSP), t is the time index, and ISTFT is the inverse short-time Fourier transform. After TDOA τ ij is estimated, the sound source direction is derived from the following equation. 1 () 180 () sin ( ) ij ij nc n d fs τ θ π ∧ ∧ − = (6) where is an estimated direction of a sound source, d ij is the distance between two microphones, and c is the sound speed (340.5 m/s, at 15 °C, in air). 2.2 PROBLEM This conventional DOA estimation using GCC- PHAT method has two problems: 1) It has not considered the diffraction of sound wave when microphones are not located in a free space such as those in the head of robots or the inside of the ear. 2) It is restricted by the sampling frequency. Since the maximum value of Equation (3) exists in the time domain through ISTFT, τ ij must depends on the sampling frequency of the signal. For example, if a sampling θ ∧ † Ui-Hyun Kim, Toru Takahashi, Tetsuya Ogata, and Hiroshi G. Okuno are with Speech Media Processing Group, Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida- honmachi, Sakyo-ku, Kyoto, 606-8501, Japan (e-mail: {euihyun, tall, ogata, okuno}@kuis.kyoto-u.ac.jp). Copyright 2011 Information Processing Society of Japan. All Rights Reserved. 2-15 4C-1 情報処理学会第 73 回全国大会