IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 8, AUGUST 2002 1855 A Frequency Domain Blind Signal Separation Method Based on Decorrelation Daniël W. E. Schobben and Piet C. W. Sommen, Member, IEEE Abstract—This paper addresses the issue of separating multiple speakers from mixtures of these that are obtained using multiple microphones in a room. An adaptive blind signal separation algo- rithm, which is entirely based on second-order statistics, is derived. One of the advantages of this algorithm is that no parameters need to be tuned. Moreover, an extension of the algorithm that can si- multaneously deal with blind signal separation and echo cancella- tion is derived. Experiments with real recordings have been carried out, showing the effectiveness of the algorithm for real-world sig- nals. Index Terms—Audio applications, blind signal separation, echo cancellation, second-order statistics. I. INTRODUCTION H UMANS can focus their attention on any one sound source out of a mixture. This was termed the “cocktail party effect” by Cherry [1]. This ability is due to the relations between the signals that are picked up by the left and the right ear, e.g., interaural differences in time and intensity. It is difficult, however, to understand speech that is recorded at a cocktail party with only one microphone, even for people with perfect hearing capabilities. People with hearing impairments readily have these problems when present at a cocktail party. Current audio systems cannot discern one sound from another like humans can. Blind signal separation (BSS) deals with the problem of recovering independent signals using only observed mixtures of these. These techniques are termed blind as the acoustic transfer functions from the sources to the microphones are un- known, and there are no reference signals against which the recovered source signals can be compared. For acoustic appli- cations a convolutive separation algorithm is required, i.e., the separation consists of employing multichannel finite impulse response (MC-FIR) filtering to these signals. BSS algorithms have been successful in separating nonconvolutive mixtures of nonreal-world signals for over a decade. Successful BSS of nonconvolutive mixed, delayed mixed, and synthetical convolu- tive mixed audio signals were reported in [2]–[4], respectively. Only after 1995 were successful experiments reported with the comprehensive problem of separating signals that are recorded Manuscript received August 27, 1999; revised April 30, 2002. The associate editor coordinating the review of this paper and approving it for publication was Prof. Dr. Ir. Bart L. R. De Moor. D. W. E. Schobben was with the Technische Universiteit Eindhoven, Eind- hoven, The Netherlands. He is now with the Philips Research Laboratories, Eindhoven, The Netherlands (e-mail: Daniel.Schobben@Philips.com). P. C. W. Sommen is with the Technische Universiteit Eindhoven, Eindhoven, The Netherlands (e-mail: P.C.W.Sommen@tue.nl). Publisher Item Identifier 10.1109/TSP.2002.800417. using microphones in a real-world environment [5]–[13]. Most of these experiments are done in controlled laboratory environ- ments and require improvements to be successful in commercial applications. These techniques do not facilitate acoustic echo cancellation which is also required in audio applications. This paper presents an algorithm that requires no parameter tuning and is suitable for joint BSS and acoustic echo cancellation. An overview of the early work of nonconvolutive and convolutive BSS can be found in [14] and [15], respectively. An overview of BSS approaches that are promising for audio applications is found in [16]–[18]. The BSS algorithm that is presented in this paper is based on output decorrelation. Molgedey and Schuster minimized cross- correlations for two different time lags to achieve signal sepa- ration in the nonconvolutive case [19]. Multiple time lags are used for convolutive signal separation by other researchers [4], [8], [9], [11]–[13]. Output decorrelation uses only second-order statistics, which theoretically limits its separation capabilities as compared with higher order statistics. For real-world signals, second-order statistics can be sufficient, however, to achieve BSS [19]–[21]. Second-order statistics have the advantage that they can be estimated more reliably using less computational power than higher order statistics. In addition, HOS algorithms contain nonlinear elements that need to be tuned to the data to obtain a good performance. When applying filters to separate (or unmix) the sources, the cross-correlations of the outputs can be expressed in terms of the cross-correlations of the observed signals and the known un- mixing system [12]. A cost function that is composed of these cross-correlations of the observations and the unmixing filters can be minimized using a gradient search. However, this is a difficult task as the unmixing filters typically have thousands of coefficients that need to be estimated in audio signal separa- tion algorithms from a cost function that is a nonlinear function of these coefficients. Therefore, frequency-domain approaches have been used recently in which the signal separation is done more or less independent for each frequency [8], [9], [13]. After a description of the notations used and the assumptions in Section II, the remainder of this paper is organized as fol- lows. In Section III, the optimization criterion is described that will be minimized by the BSS algorithm. The optimization is done by minimizing the cross-correlations among the outputs of the MC-FIR separating filter. To achieve a computationally inexpensive algorithm with fast convergence, this criterion is transformed to the frequency domain in Section IV. First, in Section IV-A, the filter coefficients are expressed in the fre- quency domain such that the cross-correlations become zero. At this point, no restrictions are imposed to ensure that the filter 1053-587X/02$17.00 © 2002 IEEE