Tracking A Moving Speaker using Excitation Source Information Vikas C. Raykar, Ramani Duraiswami Perceptual Interfaces & Reality Laboratory, Institute of Advanced Computer Studies, University of Maryland, College Park, MD 20742 Email:{vikas,ramani}@umiacs.umd.edu B. Yegnanarayana, S.R. Mahadeva Prasanna Speech & Vision Laboratory, Department of Computer Science and Engg., Indian Institute of Technology, Chennai-600 036, India Email:{yegna,prasanna}@cs.iitm.ernet.in Abstract Microphone arrays are widely used to detect, locate, and track a stationary or moving speaker. The ﬁrst step is to estimate the time delay, between the speech signals received by a pair of microphones. Conventional methods like generalized cross- correlation are based on the spectral content of the vocal tract system in the speech signal. The spectral content of the speech signal is affected due to degradations in the speech signal caused by noise and reverberation. However, features corresponding to the excitation source of speech are less affected by such degra- dations. This paper proposes a novel method to estimate the time delays using the excitation source information in speech. The estimated delays are used to get the position of the moving speaker. The proposed method is compared with the spectrum- based approach using real data from a microphone array setup. 1. Introduction Many applications require the capture of high quality speech information from users who are not tethered to a close speak- ing microphone [1, 2]. In such conditions locating and tracking the speaker in the acoustical environment is essential for effec- tive communication. For instance, tracking a moving speaker is important in applications such as video-conferencing or meet- ing or lecture summarization, where the speaker may be mov- ing continuously. In this case, information about the moving speaker can be obtained from the speech signal. This informa- tion can then be fed to a video system for actuating camera pan- tilt operations to keep the speaker in focus automatically [3, 4]. This provides a signiﬁcant improvement in the overall effect of audio-visual communication for the far-end listeners. Tracking a moving speaker is also useful in multispeaker processing in which speech from a particular speaker may be enhanced with respect to others, or with respect to noise sources. The speech signal received from a speaker in an acoustical environment is corrupted both by additive noise as well as room reverberation. In the case of a moving speaker, this is further complicated by the change in the characteristics of reverbera- tion, as the speaker moves from one place to the other, due to the variability of the room impulse response with both source and receiver locations. One effective way of handling such a situation is the use of a set of spatially distributed microphones for recording the speech. The signal received by several micro- phones is processed to obtain information about the time-delay between pairs of microphones. The estimated time-delays for pairs of microphones can be used for computing location of the speaker, which can then be used for tracking. Most of the methods for time delay estimation are based on ﬁnding the time lag which maximizes the cross-correlation between ﬁltered versions of the received signals. The most commonly used method is the Generalized Cross Correlation method proposed by Knapp and Carter [5]. The GCC function Rx 1 x 2 (τ ) is computed as [5] Rx 1 x 2 (τ )=  ∞ -∞ W (ω)X1(ω)X * 2 (ω)e jωτ dω (1) where X 1 (ω), X 2 (ω) are the Fourier transforms of the mi- crophone signals x 1 (t), x 2 (t), respectively and W (ω) is the weighting function. The two commonly used weighting func- tions are the Phase Transform (PHAT) and the Maximum like- lihood (ML) weighting [5]. This ML weighting function per- forms well for low room reverberation. As the room reverbera- tion increases this method shows severe performance degrada- tions [6]. The PHAT weighting WPHAT (ω) is the other ex- treme where we completely ﬂatten out the magnitude spectrum and is given by W PHAT (ω)=1/|X 1 (ω)X * 2 (ω)|. By ﬂatten- ing out the magnitude spectrum the resulting peak in the GCC function corresponds to the dominant delay. However, the dis- advantage is that it works well only when the noise level is low. All these methods do not exploit the mechanism of speech pro- duction to get robust estimates. Recently, Brandstein [7] pro- posed a method based on the explicit knowledge of the period- icity of voiced speech. However, most of the existing methods use the spectral fea- tures which mostly correspond to the vocal tract system infor- mation in case of speech. The spectral features are corrupted during transmission due to the medium, noise and the room re- verberation. However, we show that the features corresponding to excitation source information are robust to such degradations. We discuss methods to extract the excitation source information from the speech signal and use this to estimate the time delay. The paper is organized as follows: In Section 2 a method for time-delay estimation using the excitation source informa- tion is discussed. A method for tracking a moving speaker us- ing the estimated delays from the excitation source information is proposed in Section 3. Section 4 describes experimental re- sults, as well as comparison with a spectral-based GCC-PHAT approach. The paper concludes with a summary of the present work, and with a discussion on possible extensions. 2. Time-Delay Estimation using Excitation Source Information Speech is the result of excitation of a time-varying vocal tract system with time-varying excitation [8]. The common and sig- niﬁcant mode of excitation of the vocal tract system is the voiced EUROSPEECH 2003 - GENEVA 69