STEREO-BASED STOCHASTIC MAPPING FOR ROBUST SPEECH RECOGNITION
Mohamed Aſfy, Xiaodong Cui, and Yuqing Gao
IBM T.J. Watson Research Center
1101 Old Kitchawan Road, Yorktown Heights, NY, 10598
ABSTRACT
We present a stochastic mapping technique for robust speech recog-
nition that uses stereo data. The idea is based on building a GMM
for the joint distribution of the clean and noisy channels during
training and using an iterative compensation algorithm during test-
ing. The proposed mapping was also interpreted as a mixture of
linear transforms that are estimated in a special way using stereo
data. The proposed method results in 28% relative improvement
in string error rate (SER) for digit recognition in the car, and in
about 10% relative improvement in word error rate (WER), when
applied in conjunction with multi-style training (MST), for large
vocabulary English speech recognition.
Index Terms: Noise robustness, speech recognition, non-linear
mapping, stereo-data.
1. INTRODUCTION
Building speech recognition systems that are robust to environ-
mental changes is important, especially when these systems are to
be deployed in the ſeld. In this paper we introduce a stochatic
mapping algorithm that is built using stereo data, i.e. data that
consists of simulataneous recordings of both the clean and noisy
speech. We will refer to this mapping as stereo-based stochas-
tic mapping (SSM). While it is generally difſcult to obtain stereo
data, it can be relatively easy to collect for certain scenarios, e.g.
speech recognition in the car. In some other applications of speech
recognition, e.g. our recent work on a speech-to-speech transla-
tion system [2], all we have available is a set of noise samples of
mismatch situations that will be possibly encountered in ſeld de-
ployment of the system. In these cases stereo-data can also be eas-
ily generated by adding the example noise sources to the existing
“clean” training data.
The basic idea of the algorithm is to stack both the clean and
noisy channels to form a large augmented space and to build a
statistical model in this new space. We use a Gaussian mixture
model (GMM) in this work. During testing, both the observed
noisy speech and the augmented statistical model are used to pre-
dict the clean speech. This can be viewed as some form of non-
linear mapping between the noisy and clean feature spaces that is
learned by the GMM. We point out the relationship between the
proposed mapping method and the SPLICE algorithm which uses
stereo data [1]. In addition, we show that the mapping effectively
results in a mixture of linear feature space transforms commonly
known as FMLLR [5]. This is similar in spirit to some recently
proposed mixture of linear transforms as in [3, 6]. All these lin-
ear transform mixtures, including the proposed method, differ in
their details. The resulting mapping can be used on its own, as a
front-end to a clean speech model, and also in conjunction with
multistyle training (MST). Both scenarios will be discussed in the
paper.
The paper is organized as follows. We formulate the com-
pensation algorithm in Section 2. Experimental results are given
in Section 3. We ſrst test several variants of the algorithm and
compare it to SPLICE for digit recognition in the car environment.
Then we give results when the algorithm is applied in conjunc-
tion with multistyle training (MST) for large vocabulary English
speech recognition. In both cases the proposed technique shows
signiſcant gain over the baseline. Finally we summarize our ſnd-
ings in Section 4.
2. ALGORITHM FORMULATION
Assume we have a set of stereo data {(xi ,yi )}, where x is the
clean (matched) feature representation of speech, and y is the cor-
responding noisy (mismatched) feature representation. Let N be
the number of these feature vectors, i.e 1 ≤ i ≤ N . The data itself
is an M-dimensional vector which corresponds to any reasonable
parametrization of the speech, e.g. cepstrum coef ſcients. In a di-
rect extension the y can be viewed as a concatenation of several
noisy speech vectors that are used to predict the clean speech. De-
ſne z ≡ (x, y) as the concatenation of the two channels. The ſrst
step in constructing the mapping is training the joint probability
model for p(z). We use Gaussian mixtures for this purpose, and
hence write
p(z)=
K
k=1
c
k
N (z; μ
z,k
, Σ
zz,k
) (1)
where K is the number of mixture components, c
k
, μ
z,k
, and
Σ
zz,k
, are the mixture weights, means, and covariances of each
component, respectively. In the most general case where Ln noisy
vectors are used to predict Lc clean vectors, and the original pa-
rameter space is M-dimensional, z will be of size M(Lc + Ln),
and accordingly the mean μz will be of dimension M(Lc + Ln)
and the covariance Σzz will be of size M(Lc +Ln)×M(Lc +Ln).
Also both the mean and covariance can be partitioned as
μ
z,k
=
μ
x,k
μ
y,k
(2)
Σ
zz,k
=
Σ
xx,k
Σ
xy,k
Σ
yx,k
Σ
yy,k
(3)
where subscripts x and y indicate the clean and noisy speech re-
spectively.
The mixture model in Equation (1) can be estimated in a clas-
sical way using the expectation-maximization (EM) algorithm. Once
IV 377 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007