Acoustic source’s position and orientation estimation by a microphone array network ∗ Alberto Yoshihiro Nakano, Kazumasa Yamamoto, Seiichi Nakagawa Abstract In this work, a microphone array network is em- ployed to investigate the position and orientation of an acoustic source in an enclosed environment. For each array, the source’s position and energy related features (power level of the recorded signals and cor- relation value between pairs of recorded signals) are estimated and used as input of a two stages artiﬁcial neural network (ANN) that has 2 objectives; ﬁrst, to ﬁnd the best array which has the most likely source’s position, and second, to estimate the source’s orien- tation. The outputs of the ﬁrst and second stages are deﬁned as the source’s orientation and position, respectively. Here, the position estimation is deter- mined by geometrical derivation using time delay of arrival (TDOA) from pair of microphones. 1 Introduction Acoustic source localization by microphone arrays [1] is an important task in many practical applica- tions like videoconferencing, hands-free communi- cation system and human-machine interaction [2]. Many source localization methods were proposed, like SRP-PHAT (Steered Response Power - Phase Transform) [1] based on the maximization of power obtained by steering the microphone array to all potential source positions, and methods based on TDOA [3] [4] where the optimal position can be estimated by either geometrical triangulation using the array geometry information [5] or by a searching procedure which ﬁnds the most likely point in the space that matches the estimated set of time delays of microphone pairs [6]. In this work, GCC-PHAT (Generalized Cross- Correlation - Phase Transform) method is employed in TDOA estimation [4]. In this method, a phase dependent normalized cross-power spectrum func- tion turns to a time diﬀerence dependent cross- correlation function by Fourier Transform. The value of TDOA is deﬁned as the time delay that maximizes the cross-correlation function. It is obvi- ous to assume that the correct position estimation is high dependent of the correct delay estimation; a bad TDOA estimation results in an unreliable posi- tion estimation. To improve TDOA estimation, we perform the following steps: • subband selection in the frequency domain by frames; • reliable frame selection in the time domain [7]; • reliable frame combinations. Once TDOAs are estimated, the proposed method in [5] derived to a “T”-shaped microphone array ∗ マイクロフォンネットワークアレイによる位きナカノ　アルベルト　ヨシヒロ、一、一　（学大学） composed by 4 microphones is employed in position estimation, Fig. 2. In this method, the center mi- crophone is taken as a reference and time delays are calculated to the other three microphones. The set of TDOAs are then used to estimate the loca- tion of the source by geometrical derivation in three dimensional space. In this work a microphone ar- ray network composed by eight arrays distributed at walls and ceiling is employed in position estima- tion. Estimates are obtained by array and an ANN are used to ﬁnd the best array which has the best po- sition estimation. Additionally, we try to estimate the acoustic source orientation. This paper is organized as follows: In Section 2 we brieﬂy describe the conventional TDOA estimation method based on GCC-PHAT and some modiﬁca- tions to obtain a more accurate estimation. Section 3 presents the source’s position and orientation es- timation method. In Section 4, we describe exper- imental conditions and results, and we conclude in Section 5. 2 TDOA estimation method 2.1 Conventional TDOA estimation based on GCC-PHAT Consider P arrays each one with Q microphones, where each microphone is deﬁned as q m , for m = 1,...,Q. Given a signal source s(t), the signal at each microphone can be represented as x pqm (t)= h pqm (t) ∗ s(t)+ n(t), (1) where p ∈{1,...,P }, q m ∈{q 1 ,q 2 ,...,q Q },“∗” de- notes convolution, h pqm (t) is the reverberation im- pulse response between the source s(t) and each mi- crophone array, and n(t) is the additive background noise. TDOA is estimated by GCC-PHAT function R(τ mn )=  +∞ −∞ X pqm (f )X ∗ pqn (f ) |X pqm (f )X ∗ pqn (f )| e −j2πfτmn df, (2) where R(τ mn ) is a function of the time delay τ mn of microphone pairs for m = n, X pqm (f ) and X pqn (f ) are spectral representation of signals x pqm (t) and x pqn (t), respectively. TDOA estimative ˆ τ mn corre- sponds to the time delay that maximizes R(τ mn ) ˆ τ mn = max τmn {R(τ mn )} . (3) 2.2 Improving TDOA estimation In practice, signals x pqm (t) and x pqn (t) for m = n are segmented into L frames and cross-correlation functions are estimated by frames. Then, it is reasonable to assume that a more robust cross- correlation function can be obtained combining indi- vidual contributions of diﬀerent frames. This is true - 793 - 3-P-17 日本音響学会講演論文集 2008年9月