DIRECTIONAL ACOUSTIC SOURCE’S POSITION AND ORIENTATION ESTIMATION APPROACH BY A MICROPHONE ARRAY NETWORK Alberto Yoshihiro Nakano, Kazumasa Yamamoto, Seiichi Nakagawa Toyohashi University of Technology, Toyohashi 441-8580, Japan Department of Information and Computer Sciences {alberto,kyama,nakagawa}@slp.ics.tut.ac.jp ABSTRACT In this paper we present a directional acoustic source’s position and orientation estimation approach by a microphone array network in an enclose environment. In this work we assumed that at least one array in the network yields an acceptable position estimation, and based on this assumption we try to automatically select this array and determine the source’s orientation. Here, position estimates are de- termined indirectly using time delay of arrival (TDOA) from micro- phone pairs. A position candidate and energy related features (power level of the recorded signals and correlation value between pairs of recorded signals) were determined a priori for each array, and used as input of an artiﬁcial neural network (ANN) whose outputs are the orientation and the array which has the most likely source’s posi- tion. Additionally, we derive weighting functions based on the ANN output and use them to combine array estimates to obtain a more reliable position estimation. Index Terms— Microphone array network, Source localization, Source orientation estimation, Neural network 1. INTRODUCTION Acoustic source localization by microphone arrays [1] is an impor- tant task in many practical applications like videoconferencing [2], hands-free communication system [3] and human-machine interac- tion [4]. Source’s orientation also plays an important rule in acous- tic position localization, because a directional source doesn’t radiate uniformly in all directions and the quality of signals recorded by a distant array is affected not only by environmental noise and rever- beration, but also by the relative orientation [5]. In this work the position localization estimates are determined by geometrical trian- gulation using TDOA of microphone pairs of an array determined by GCC-PHAT (Generalized Cross-Correlation - Phase Transform) [6]. This work focuses on estimation by triangulation because it requires low computational processing desired for a real time ap- plication. Results are compared with the steered response power with phase transform (SRP-PHAT) [1] position localization method. GCC-PHAT is known to be robust to reverberation but weak to noise [7, 8]. To reduce noise effect, frames with low signal-to-noise ratio (SNR) are disregarded, and only frequency bands with high SNR are selected. In a ﬁnal step, frames are combined to generate a more ro- bust cross-correlation function resulting in more reliable TDOA, and consequently a more trustful position estimate. Different source orientation approaches can be found in [5, 9, 10, 11], but in this work an ANN approach, which uses energy re- We would like to thanks the Global COE program “Frontiers of Intelli- gent Sensing” and MEXT for supporting our research. lated features and position estimates, is used to select the array which yield the best source position estimate together with the source ori- entation. We could say that the ANN uses the energy related features in an attempt to model the source radiation pattern and then predicts the source orientation. Additionally in this work, weighting func- tions are derived from the ANN output and used to combine position estimates of different microphone arrays. The outline of this paper is organized as follows: In Section 2, we describe the GCC-PHAT function, the TDOA estimation method, the TDOA-based and the SRP-PHAT position estimation methods. The source’s position and orientation estimation method as well as weighting functions are presented in Section 3. In Section 4 we have experiments. Results and conclusions are in the last two sections. 2. BACKGROUND 2.1. Generalized cross-correlation with phase transform (GCC- PHAT) and TDOA estimation method Consider P arrays, each one with Q microphones where each mi- crophone is deﬁned as qm, for m =1,...,Q. Given a signal source s(t), the signal at each microphone can be represented as xpqm (t)= hpqm (t) ∗ s(t)+ n(t), (1) where p ∈{1,...,P }, qm ∈{q1,q2,...,qQ},“∗” denotes con- volution, hpqm (t) is the reverberation impulse response between the source s(t) and microphone qm of array p, and n(t) is the additive background noise. GCC-PHAT is deﬁned as R(τmn)= Z +∞ -∞ Xpqm (f )X ∗ pqn (f ) |Xpqm (f )X ∗ pqn (f )| e -j2πfτmn df, (2) where R(τmn) is a function of the time delay τmn of microphone pairs for m = n, Xpqm (f ) and Xpqn (f ) are spectral representation of signals xpqm (t) and xpqn (t), respectively. TDOA estimative ˆ τmn corresponds to the time delay that maxi- mizes R(τmn) as ˆ τmn = max τmn {R(τmn)} . (3) Noise and reverberation in the test environment affect the perfor- mance of GCC-PHAT method for TDOA estimation. To deal with these inﬂuences, a robust GCC-PHAT function is obtained by select- ing cross-power spectrum components with high energy (subband selection), selecting frames with high SNR (reliable frames selec- tion) and combining these frames. The process is illustrated in Fig. 1. In subband selection, a binary mask is created to select only high 606 978-1-4244-3677-4/09/$25.00 ©2009 IEEE