DIRECTIONAL ACOUSTIC SOURCE’S POSITION AND ORIENTATION ESTIMATION
APPROACH BY A MICROPHONE ARRAY NETWORK
Alberto Yoshihiro Nakano, Kazumasa Yamamoto, Seiichi Nakagawa
Toyohashi University of Technology, Toyohashi 441-8580, Japan
Department of Information and Computer Sciences
{alberto,kyama,nakagawa}@slp.ics.tut.ac.jp
ABSTRACT
In this paper we present a directional acoustic source’s position and
orientation estimation approach by a microphone array network in
an enclose environment. In this work we assumed that at least one
array in the network yields an acceptable position estimation, and
based on this assumption we try to automatically select this array and
determine the source’s orientation. Here, position estimates are de-
termined indirectly using time delay of arrival (TDOA) from micro-
phone pairs. A position candidate and energy related features (power
level of the recorded signals and correlation value between pairs of
recorded signals) were determined a priori for each array, and used
as input of an artificial neural network (ANN) whose outputs are the
orientation and the array which has the most likely source’s posi-
tion. Additionally, we derive weighting functions based on the ANN
output and use them to combine array estimates to obtain a more
reliable position estimation.
Index Terms— Microphone array network, Source localization,
Source orientation estimation, Neural network
1. INTRODUCTION
Acoustic source localization by microphone arrays [1] is an impor-
tant task in many practical applications like videoconferencing [2],
hands-free communication system [3] and human-machine interac-
tion [4]. Source’s orientation also plays an important rule in acous-
tic position localization, because a directional source doesn’t radiate
uniformly in all directions and the quality of signals recorded by a
distant array is affected not only by environmental noise and rever-
beration, but also by the relative orientation [5]. In this work the
position localization estimates are determined by geometrical trian-
gulation using TDOA of microphone pairs of an array determined
by GCC-PHAT (Generalized Cross-Correlation - Phase Transform)
[6]. This work focuses on estimation by triangulation because it
requires low computational processing desired for a real time ap-
plication. Results are compared with the steered response power
with phase transform (SRP-PHAT) [1] position localization method.
GCC-PHAT is known to be robust to reverberation but weak to noise
[7, 8]. To reduce noise effect, frames with low signal-to-noise ratio
(SNR) are disregarded, and only frequency bands with high SNR are
selected. In a final step, frames are combined to generate a more ro-
bust cross-correlation function resulting in more reliable TDOA, and
consequently a more trustful position estimate.
Different source orientation approaches can be found in [5, 9,
10, 11], but in this work an ANN approach, which uses energy re-
We would like to thanks the Global COE program “Frontiers of Intelli-
gent Sensing” and MEXT for supporting our research.
lated features and position estimates, is used to select the array which
yield the best source position estimate together with the source ori-
entation. We could say that the ANN uses the energy related features
in an attempt to model the source radiation pattern and then predicts
the source orientation. Additionally in this work, weighting func-
tions are derived from the ANN output and used to combine position
estimates of different microphone arrays.
The outline of this paper is organized as follows: In Section 2,
we describe the GCC-PHAT function, the TDOA estimation method,
the TDOA-based and the SRP-PHAT position estimation methods.
The source’s position and orientation estimation method as well as
weighting functions are presented in Section 3. In Section 4 we have
experiments. Results and conclusions are in the last two sections.
2. BACKGROUND
2.1. Generalized cross-correlation with phase transform (GCC-
PHAT) and TDOA estimation method
Consider P arrays, each one with Q microphones where each mi-
crophone is defined as qm, for m =1,...,Q. Given a signal source
s(t), the signal at each microphone can be represented as
xpqm
(t)= hpqm
(t) ∗ s(t)+ n(t), (1)
where p ∈{1,...,P }, qm ∈{q1,q2,...,qQ},“∗” denotes con-
volution, hpqm
(t) is the reverberation impulse response between the
source s(t) and microphone qm of array p, and n(t) is the additive
background noise. GCC-PHAT is defined as
R(τmn)=
Z
+∞
-∞
Xpqm
(f )X
∗
pqn
(f )
|Xpqm
(f )X
∗
pqn
(f )|
e
-j2πfτmn
df, (2)
where R(τmn) is a function of the time delay τmn of microphone
pairs for m = n, Xpqm
(f ) and Xpqn
(f ) are spectral representation
of signals xpqm
(t) and xpqn
(t), respectively.
TDOA estimative ˆ τmn corresponds to the time delay that maxi-
mizes R(τmn) as
ˆ τmn = max
τmn
{R(τmn)} . (3)
Noise and reverberation in the test environment affect the perfor-
mance of GCC-PHAT method for TDOA estimation. To deal with
these influences, a robust GCC-PHAT function is obtained by select-
ing cross-power spectrum components with high energy (subband
selection), selecting frames with high SNR (reliable frames selec-
tion) and combining these frames. The process is illustrated in Fig.
1. In subband selection, a binary mask is created to select only high
606 978-1-4244-3677-4/09/$25.00 ©2009 IEEE