SPEECH RECOGNITION IN MULTISOURCE REVERBERANT ENVIRONMENTS WITH BINAURAL INPUTS Nicoleta Roman 1 , Soundararajan Srinivasan 2 and DeLiang Wang 3 1 Department of Mathematics, Statistics and Computer Science The Ohio State University at Lima, Lima, OH, 45804, USA 2 Biomedical Engineering Center 3 Department of Computer Science and Engineering and Center for Cognitive Science The Ohio State University, Columbus, OH, 43210, USA {niki, srinivso, dwang}@cse.ohio-state.edu ABSTRACT We present a binaural solution to robust speech recognition in multi-source reverberant environments. We employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. Our system estimates this ideal binary mask at the output of a target cancellation module implemented using adaptive filtering. This mask is used in conjunction with a missing-data algorithm to decode the target utterance. A systematic evaluation in terms of automatic speech recognition (ASR) performance shows substantial improvements over the baseline performance and better results over related two-microphone approaches. 1. INTRODUCTION A typical auditory environment contains multiple concurrent sources that are also reflected by surfaces and may change their locations constantly. While human listeners are able to segregate and recognize a target signal under such adverse conditions, ASR remains a challenging problem [1]. ASR systems are trained on clean speech and face the problem of mismatch when tested in noisy and reverberant conditions. In this paper we address the problem of recognizing target speech from multi-source reverberant binaural recordings. Microphone array processing techniques which enhance the target speech have been employed to improve the robustness of ASR systems in noisy environments [2]. These techniques are divided in two broad categories: beamforming and independent component analysis (ICA) [3]. To separate multiple sound sources, beamforming takes advantage of their different directions of arrival while ICA relies on their statistical independence. A fixed beamformer, such as that of the delay-and-sum, constructs a spatial beam to enhance signals arriving from the target direction independent of the interfering sources. A large number of microphones are however required in order to impose a constant beam shape across frequencies [3]. Adaptive beamforming techniques, on the other hand, attempt to null out the interfering sources in the mixture [4] [5]. While an adaptive beamformer with two microphones is optimal for canceling a single directional interference, additional microphones are required as the number of noise sources increases. Similarly, the drawbacks of ICA techniques include the requirement that the number of microphones be greater than or equal to the number of sources and poor performance in reverberant conditions [5]. Some recent sparse representations attempt to relax the former assumption but the performance is limited [6]. While the above techniques enhance target speech independently of the recognizer, Seltzer et al. optimize an adaptive filter based on recognition results [7]. Inspired by the robustness of the human auditory system, research in computational auditory scene analysis (CASA) has been devoted to build speech separation systems that incorporate known principles of auditory perception [8]. In particular, binaural CASA systems which utilize location information have shown very good recognition results in anechoic conditions. Reverberation, however, introduces potentially an infinite number of sources due to reflections from hard surfaces. As a result, the estimation of location cues in individual T-F units becomes unreliable and the performance of location-based segregation systems degrades. A notable exception is the binaural system proposed by Palomäki et al. [9] which includes an inhibition mechanism that emphasizes the onset portions of the signal and groups them according to common location. The system shows improved speech recognition results across a range of reverberation times with a single interference. From an information processing perspective, the notion of an ideal T-F binary mask has been proposed as the computational goal of CASA [10]. Such a mask can be constructed from a priori knowledge of target and interference; specifically a value of 1 in the mask indicates that the target is stronger than the interference within a particular T-F unit and 0 indicates otherwise. Previously, we have proposed a binaural system that is capable of estimating the ideal binary mask under multi-source reverberant conditions [11] and reported results using a missing-data recognizer [12] trained on reverberant speech. Note that the missing-data recognizer treats the units labeled 1 in the mask as reliable data and the others as unreliable during recognition. To avoid using a different model for each reverberant condition, it is desirable to train the ASR on anechoic data. However, we find that the performance of the missing-data recognizer degrades considerably when obtained using anechoic training. In this paper, we propose an alternate approach using a speech prior based spectrogram reconstruction technique [13]. In this technique, the target speech values in the unreliable T-F units are estimated by conditioning on the reliable ones. We observe that the reliable units in the mask correspond to regions in the spectrogram dominated by relatively clean target speech. Hence, the prior I  309 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006