2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY A PROBABILISTIC APPROACH TO ACOUSTIC ECHO CLUSTERING AND SUPPRESSION Mehrez Souden, Jason Wung, and Biing-Hwang (Fred) Juang Center for Signal and Image Processing, Georgia Institute of Technology, 75 Fifth Street NW, Atlanta, GA 30308, USA ABSTRACT This paper introduces an approach to cluster and suppress acous- tic echo signals in hands-free, full-duplex speech communication systems. We employ the instantaneous recursive estimate of the magnitude squared coherence (MSC) of the echo line signal and the microphone signal, and model it with a two-component Beta mix- ture distribution. Since we consider the case of multiple microphone pickup, we further integrate the normalized recording vector as lo- cation feature into the proposed approach to achieve reliable soft decisions on the echo presence. The location information has been widely used for clustering-based blind source separation, and can be modeled using a Watson mixture distribution. Simulation evalu- ations of the proposed method show that it can achieve significant echo suppression performance. Index TermsAcoustic echo suppression, clustering, magni- tude squared coherence, normalized recording vector. 1. INTRODUCTION The acoustic coupling between loudspeakers and microphones is known as one of the major issues that can impede the widespread use of hands-free full-duplex speech communication systems. To cope with this issue, several acoustic echo cancellation (AEC) tech- niques have been proposed [1,2]. In many of the existing solutions the goal has typically been to estimate a finite impulse response (FIR) filter, which emulates the echo path in the time domain. The result is then convoluted with the reference loudspeaker signal and subtracted from the microphone signal. Further suppression of the residual echo may be necessary and is conventionally dealt with by frequency-domain techniques such as spectral subtraction and Wiener filtering [3–5]. These traditional signal enhancement tech- niques, while having been shown to deliver a certain degree of per- formance, carry the assumption of signal stationarity and normally do not take full advantage of the sparsity in the speech signal, such as its harmonic structure and the fractional duty cycle [6]. Newer techniques that consider these speech-specific characteristics have been demonstrated to be effective in other applications including clustering-based blind source separation (BSS) [6–10]. In this paper, we approach the problem of acoustic echo sup- pression in the frequency domain from a clustering-based view- point. Specifically, we employ instantaneous estimates of the mag- nitude squared coherence (MSC) between the reference signal and the microphone signals as a feature to detect the presence of the echo at individual time-frequency (t-f) slots. The MSC was success- fully used for double talk detection by averaging over multiple fre- quency bins and comparing to a certain threshold [11]. In [12, 13], This work was supported in part by the Natural Sciences and Engineer- ing Research Council of Canada (NSERC). Figure 1: A block diagram of the proposed acoustic echo sup- pression system. The EM algorithm jointly employs the MSC and normalized microphone signals (location feature) to determine the near-end posterior probability. ˆ X12(t), ..., ˆ XM2(t) are the final es- timates of the echo-free near-end signal at the microphones. it was modeled using a bimodal Gaussian distribution to determine soft double-talk decisions when combined with echo suppression filters. Since the MSC assumes values in the range of [0, 1],a better choice of a statistical model would be based on Beta dis- tribution. Therefore, we propose utilizing a Beta mixture model with two components that respectively account for the absence and presence of echo in the microphone signals. We then combine this model with the location information, which can be captured when multiple microphones are employed. Earlier investigation, e.g., [6–8,10] have shown that the latter cue is extremely useful in clustering-based BSS. We integrate both features in an expectation maximization (EM) algorithm, and determine an echo suppression mask, which is optimal in the maximum likelihood sense. 2. DATA MODEL Figure 1 depicts an example of the investigated scenario, where a near-end speaker has a conversation with one or multiple partici- pants in the far-end room. The model can be easily generalized to the case where multiple loudspeakers are used in the near-end room as investigated in Section 4. However, for simplicity we consider only one loudspeaker signal in our derivations of the proposed ap- proach. Because we are investigating the problem of acoustic echo clustering and suppression in the frequency domain, we consider the short time Fourier transform (STFT) domain representation of the data model, which is expressed at frame t and frequency bin k =1, ..., K, as y(k, t)= x1(k, t)+ x2(k, t)+ v(k, t). (1) 978-1-4799-0972-8/13/$31.00 ©2013IEEE