Stereo-input Speech Recognition using Sparseness-based Time-frequency Masking in a Reverberant Environment Yosuke Izumi 1 , Kenta Nishiki 1∗ , Shinji Watanabe 2 , Takuya Nishimoto 1 , Nobutaka Ono 1 , Shigeki Sagayama 1 1 Department of Information Physics and Computing, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo，Japan. 2 NTT Communication Science Laboratories † 2-4 Hikaridai Seikacho, Soraku-gun, Kyoto, Japan. {izumi, nishiki, nishi, onono, sagayama}@hil.t.u-tokyo.ac.jp, watanabe@cslab.kecl.ntt.co.jp Abstract We present noise robust automatic speech recognition (ASR) using sparseness-based underdetermined blind source separa- tion (BSS) technique. As a representative underdetermined BSS method, we utilized time-frequency masking in this paper. Al- though time-frequency masking is able to separate target speech from interferences effectively, one should consider two prob- lems. One is that masking does not work well in noisy or re- verberant environment. Another is that masking itself might cause some distortion of the target speech. For the former, we apply our time-frequency masking method [7] which can separate the target signal robustly even in noisy and reverber- ant environment. Next, investigating the distortion caused by time-frequency masking, we reveal following facts through ex- periments: 1) soft mask is better than binary mask in terms of recognition performance and 2) cepstral mean normalization (CMN) reduces the distortion, especially for that caused by soft mask. At the end, we evaluate the recognition performance of our method in noisy and reverberant real environment. Index Terms: time-frequency mask, speech sparseness, blind source separation, stereo-input, robust ASR 1. Introduction Noise robustness is a very signiﬁcant aspect of automatic speech recognition (ASR) because its performance severely degrades due to the noise which unavoidably exists in our living space. Several simple and effective techniques to suppress station- ary noise, e.g. spectral subtraction (SS) for additivity noise [1, 2] and cepstral mean normalization (CMN) for channel distortion[3], have been developped so far. Recently, many speech enhancement methods using microphone array have been proposed as the front-end to reﬁne the ASR robustness for nonstationary noise [4, 5]. Especially stereo-input ASR be- comes a promissing approach since existing devices, such as normal PC and IC recorder, have a two-channel input. There- fore, this paper focuses on development of noise robust ASR techniques by using two-channel input devices. In a real environment, we often hear interferences with tar- get speech. And locations of the target and interferences are usually not known. In addition, the number of sound sources might be greater than that of microphones in the scenario of two-channel devices. Therefore, we should deal with a under- ∗ current afﬁliation: NTT Information Sharing Platform Laboratories 3-9-11 Midori-cho, Musashino-city, Tokyo, Japan. determined blind source separation (BSS) problem. BSS is de- ﬁned as a problem to separate multiple source signals from mix- tures without any prior information about mixing process. One can separate sources using estimated inverse of the mixing ma- trix if the number of sources is equal to or less than that of mix- tures. However, in the case where sources outnumber mixtures, i.e., underdetermined case, one can not separate them even if appropriate mixing matrices are estimated. In that sense, under- determined BSS is a hard problem but matches our scenario. Time-frequency masking based on speech sparseness is an effective approach for underdetermined BSS. Here sparseness means a property of speech that its energy is concentrated in a small area of time-frequency plain. Most of masking meth- ods assume that individual source does not overlap in the time- frequency domain and obtain the target signal by multiplying an appropriate mask by the observation. Time-frequency mask can be classiﬁed into two types: 1) binary mask which has a value 0 or 1 and 2) soft mask which has a continuous value [0, 1] at each time-frequency bin respectively. The cue to design masks is the time delay between two-channel observed signals. How- ever, it is disturbed by background noise and reverberation be- cause they are not sparse and comes from various directions. Additionaly, time-frequency masking itself might cause some distortion to the target speech signal from the viewpoint of the ASR. Although it is well known that time-frequency masking causes distortion called musical noise, we will show it is very effective to suppress interference as the front-end of ASR sys- tem in this paper. We have developed a time-frequency mask- ing method based on maximum likelihood estimation and it has good separation performance in terms of SNR. We will show that our soft masking method is not only able to separate tar- get signal robustly but also performs less distortion for ASR system than binary masking method. Moreover, the remained distortion can be further reduced by CMN and consequently, recognition performance improves considerably. To show the effectiveness of our method, we performed detail investigation through connected digit recognition tasks in reverberant situa- tion in both of simulation and real environments. It contains comparison between conventional binary, our binary and soft masking separation and investigation of the effect of CMN. Copyright  2009 ISCA 6 - 10 September, Brighton UK 1955