2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
A PROBABILISTIC APPROACH TO ACOUSTIC ECHO CLUSTERING AND SUPPRESSION
Mehrez Souden, Jason Wung, and Biing-Hwang (Fred) Juang
Center for Signal and Image Processing, Georgia Institute of Technology,
75 Fifth Street NW, Atlanta, GA 30308, USA
ABSTRACT
This paper introduces an approach to cluster and suppress acous-
tic echo signals in hands-free, full-duplex speech communication
systems. We employ the instantaneous recursive estimate of the
magnitude squared coherence (MSC) of the echo line signal and the
microphone signal, and model it with a two-component Beta mix-
ture distribution. Since we consider the case of multiple microphone
pickup, we further integrate the normalized recording vector as lo-
cation feature into the proposed approach to achieve reliable soft
decisions on the echo presence. The location information has been
widely used for clustering-based blind source separation, and can
be modeled using a Watson mixture distribution. Simulation evalu-
ations of the proposed method show that it can achieve significant
echo suppression performance.
Index Terms— Acoustic echo suppression, clustering, magni-
tude squared coherence, normalized recording vector.
1. INTRODUCTION
The acoustic coupling between loudspeakers and microphones is
known as one of the major issues that can impede the widespread
use of hands-free full-duplex speech communication systems. To
cope with this issue, several acoustic echo cancellation (AEC) tech-
niques have been proposed [1,2]. In many of the existing solutions
the goal has typically been to estimate a finite impulse response
(FIR) filter, which emulates the echo path in the time domain. The
result is then convoluted with the reference loudspeaker signal and
subtracted from the microphone signal. Further suppression of the
residual echo may be necessary and is conventionally dealt with
by frequency-domain techniques such as spectral subtraction and
Wiener filtering [3–5]. These traditional signal enhancement tech-
niques, while having been shown to deliver a certain degree of per-
formance, carry the assumption of signal stationarity and normally
do not take full advantage of the sparsity in the speech signal, such
as its harmonic structure and the fractional duty cycle [6]. Newer
techniques that consider these speech-specific characteristics have
been demonstrated to be effective in other applications including
clustering-based blind source separation (BSS) [6–10].
In this paper, we approach the problem of acoustic echo sup-
pression in the frequency domain from a clustering-based view-
point. Specifically, we employ instantaneous estimates of the mag-
nitude squared coherence (MSC) between the reference signal and
the microphone signals as a feature to detect the presence of the
echo at individual time-frequency (t-f) slots. The MSC was success-
fully used for double talk detection by averaging over multiple fre-
quency bins and comparing to a certain threshold [11]. In [12, 13],
This work was supported in part by the Natural Sciences and Engineer-
ing Research Council of Canada (NSERC).
Figure 1: A block diagram of the proposed acoustic echo sup-
pression system. The EM algorithm jointly employs the MSC and
normalized microphone signals (location feature) to determine the
near-end posterior probability.
ˆ
X12(t), ...,
ˆ
XM2(t) are the final es-
timates of the echo-free near-end signal at the microphones.
it was modeled using a bimodal Gaussian distribution to determine
soft double-talk decisions when combined with echo suppression
filters. Since the MSC assumes values in the range of [0, 1],a
better choice of a statistical model would be based on Beta dis-
tribution. Therefore, we propose utilizing a Beta mixture model
with two components that respectively account for the absence and
presence of echo in the microphone signals. We then combine
this model with the location information, which can be captured
when multiple microphones are employed. Earlier investigation,
e.g., [6–8,10] have shown that the latter cue is extremely useful in
clustering-based BSS. We integrate both features in an expectation
maximization (EM) algorithm, and determine an echo suppression
mask, which is optimal in the maximum likelihood sense.
2. DATA MODEL
Figure 1 depicts an example of the investigated scenario, where a
near-end speaker has a conversation with one or multiple partici-
pants in the far-end room. The model can be easily generalized to
the case where multiple loudspeakers are used in the near-end room
as investigated in Section 4. However, for simplicity we consider
only one loudspeaker signal in our derivations of the proposed ap-
proach. Because we are investigating the problem of acoustic echo
clustering and suppression in the frequency domain, we consider
the short time Fourier transform (STFT) domain representation of
the data model, which is expressed at frame t and frequency bin
k =1, ..., K, as
y(k, t)= x1(k, t)+ x2(k, t)+ v(k, t). (1)
978-1-4799-0972-8/13/$31.00 ©2013IEEE