STEREO-BASED STOCHASTIC MAPPING FOR ROBUST SPEECH RECOGNITION Mohamed Aſfy, Xiaodong Cui, and Yuqing Gao IBM T.J. Watson Research Center 1101 Old Kitchawan Road, Yorktown Heights, NY, 10598 ABSTRACT We present a stochastic mapping technique for robust speech recog- nition that uses stereo data. The idea is based on building a GMM for the joint distribution of the clean and noisy channels during training and using an iterative compensation algorithm during test- ing. The proposed mapping was also interpreted as a mixture of linear transforms that are estimated in a special way using stereo data. The proposed method results in 28% relative improvement in string error rate (SER) for digit recognition in the car, and in about 10% relative improvement in word error rate (WER), when applied in conjunction with multi-style training (MST), for large vocabulary English speech recognition. Index Terms: Noise robustness, speech recognition, non-linear mapping, stereo-data. 1. INTRODUCTION Building speech recognition systems that are robust to environ- mental changes is important, especially when these systems are to be deployed in the ſeld. In this paper we introduce a stochatic mapping algorithm that is built using stereo data, i.e. data that consists of simulataneous recordings of both the clean and noisy speech. We will refer to this mapping as stereo-based stochas- tic mapping (SSM). While it is generally difſcult to obtain stereo data, it can be relatively easy to collect for certain scenarios, e.g. speech recognition in the car. In some other applications of speech recognition, e.g. our recent work on a speech-to-speech transla- tion system [2], all we have available is a set of noise samples of mismatch situations that will be possibly encountered in ſeld de- ployment of the system. In these cases stereo-data can also be eas- ily generated by adding the example noise sources to the existing “clean” training data. The basic idea of the algorithm is to stack both the clean and noisy channels to form a large augmented space and to build a statistical model in this new space. We use a Gaussian mixture model (GMM) in this work. During testing, both the observed noisy speech and the augmented statistical model are used to pre- dict the clean speech. This can be viewed as some form of non- linear mapping between the noisy and clean feature spaces that is learned by the GMM. We point out the relationship between the proposed mapping method and the SPLICE algorithm which uses stereo data [1]. In addition, we show that the mapping effectively results in a mixture of linear feature space transforms commonly known as FMLLR [5]. This is similar in spirit to some recently proposed mixture of linear transforms as in [3, 6]. All these lin- ear transform mixtures, including the proposed method, differ in their details. The resulting mapping can be used on its own, as a front-end to a clean speech model, and also in conjunction with multistyle training (MST). Both scenarios will be discussed in the paper. The paper is organized as follows. We formulate the com- pensation algorithm in Section 2. Experimental results are given in Section 3. We ſrst test several variants of the algorithm and compare it to SPLICE for digit recognition in the car environment. Then we give results when the algorithm is applied in conjunc- tion with multistyle training (MST) for large vocabulary English speech recognition. In both cases the proposed technique shows signiſcant gain over the baseline. Finally we summarize our ſnd- ings in Section 4. 2. ALGORITHM FORMULATION Assume we have a set of stereo data {(xi ,yi )}, where x is the clean (matched) feature representation of speech, and y is the cor- responding noisy (mismatched) feature representation. Let N be the number of these feature vectors, i.e 1 ≤ i ≤ N . The data itself is an M-dimensional vector which corresponds to any reasonable parametrization of the speech, e.g. cepstrum coef ſcients. In a di- rect extension the y can be viewed as a concatenation of several noisy speech vectors that are used to predict the clean speech. De- ſne z ≡ (x, y) as the concatenation of the two channels. The ſrst step in constructing the mapping is training the joint probability model for p(z). We use Gaussian mixtures for this purpose, and hence write p(z)= K k=1 c k N (z; μ z,k , Σ zz,k ) (1) where K is the number of mixture components, c k , μ z,k , and Σ zz,k , are the mixture weights, means, and covariances of each component, respectively. In the most general case where Ln noisy vectors are used to predict Lc clean vectors, and the original pa- rameter space is M-dimensional, z will be of size M(Lc + Ln), and accordingly the mean μz will be of dimension M(Lc + Ln) and the covariance Σzz will be of size M(Lc +Ln)×M(Lc +Ln). Also both the mean and covariance can be partitioned as μ z,k = μ x,k μ y,k (2) Σ zz,k = Σ xx,k Σ xy,k Σ yx,k Σ yy,k (3) where subscripts x and y indicate the clean and noisy speech re- spectively. The mixture model in Equation (1) can be estimated in a clas- sical way using the expectation-maximization (EM) algorithm. Once IV  377 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007