A Multi-Stage, Multi-Channel Processing System for Overlapping Speech Sep- aration in a Real Scenario Rahil Mahdian Toroghi, Youssef Oualil, Dietrich Klakow Spoken Language Systems, Saarland University, Saarbrücken, Germany Email: {rahil.mahdian, youssef.oualil, dietrich.klakow}@lsv.uni-saarland.de Web: www.lsv.uni-saarland.de Abstract This paper addresses the problem of overlapping speech separation in a noisy room using a microphone array. The presented approach proposes a multistage processing frame- work to separate the desired sources and reduce the corrup- tive effects of noise, reverberation and interference. More speciﬁcally, 1) a beamformer separates the sources based on their location diversities, 2) a postﬁlter maximizes the output SNRs, and 3) a novel ﬁlter is derived to suppress the coherent terms at each output with respect to its con- trasting one. Finally, 4) the clean signal is estimated using a modiﬁed masking ﬁlter. Exploiting the fact that a desired signal remains coherent within time frames, the mask is smoothed between frames to preserve this coherency and reduce the musical noise. Experiments on AMI-Wall Street Journal corpus show a signiﬁcant improvement in speech quality, SNR, Source to Reverberation Ratio, and natural- ness of the proposed method, compared to some methods in Blind Source Separation. 1 Introduction Separation of speech sources which are recorded in closed areas is an essential requirement for several applications, such as meeting recognition, automatic classnotes transcrib- ing, and so on. Speech separation is a hard problem, but can be facilitated to some degree by the use of an array of microphones, especially when the geometry of the array is also known a priori. Multiple recordings of the speech data enables us to denoise or dereverberate the signals of interest without distortion, at least theoretically [1]. Utiliz- ing the fact that speakers are located at different positions in the room, spatial ﬁltering (beamforming) can be used to exploit this spatial information of the sources and extract higher quality source signals out of the corrupted input ar- ray data. In the presence of overlapping speakers, the conditions of the separation problem in a room environment get far more difﬁcult to handle [2]. In this paper, following the line of thought of our previ- ous work, [3], we present a multi-step processing system that is able to cope with the three corrupting effects found in every noisy echoic environment, namely noise, rever- beration, and interference. The contribution of this paper is three-fold: 1) The system structure that can be used in any echoic environment along with the results that justify it, 2) Derivation of the model for a ﬁlter that suppresses the coherent terms from the signals, and 3) A modiﬁca- tion on the binary mask that enables us to account for the signal correlations over the neighboring frames, especially when the signal contained in these frames is due to a voiced phoneme. The remaining part of the paper is organized as fol- lows. Next section reviews the background theory of the Figure 1: Block diagram of the multi-channel speech sep- aration system of two sources in each frequency bin, used in this paper beamforming and postﬁltering. Subsection 2.2 presents the problem formulation and justiﬁes the processes used in the proposed structure. Subsection 2.3 reviews masking. Sec- tion 3 presents the experiments and is followed by the re- sults and comparisons with some methods in Blind Source Separation (BSS). 2 Structure of the System The overall system structure to separate two sources in a room is depicted in Fig. 1. This structure employs beam- forming to extract the desired sources based on their unique geometrical position, a postﬁlter to increase the level of SNR, and two other stages: 1) A stage to suppress the portion of each output that is coherent with the contrast- ing output (that are emanated from the same sources), 2) A masking stage that accounts for source presence and temporal correlations in neighboring source frames. This structure can be utilized in realtime and only the masks calculated from the last frame need to be saved. 2.1 Background Theory 2.1.1 Beamforming Beamforming (BF) aims at extracting the signal coming from the desired direction while suppressing the noise, re- verberation, and interfering signals that are entering the ar- ray from other directions. The difference in the positions of the sources causes different Time Differences of Ar- rival (TDOA) with respect to the microphones in the array which is exploited in BF design. Let us consider a plane wave approaching the array aperture from a direction a =[cos θ sin φ sin θ sin φ cos φ] T (1)