DISCRIMINATIVE FEATURE TRANSFORMS USING DIFFERENCED MAXIMUM MUTUAL
INFORMATION
Marc Delcroix, Atsunori Ogawa, Shinji Watanabe*, Tomohiro Nakatani, Atsushi Nakamura
NTT Communication Science Laboratories, NTT corporation,
2-4, Hikaridai, Seika-cho (Keihanna Science City), Soraku-gun, Kyoto 619-0237 Japan
{marc.delcroix,ogawa.atsunori,nakatani.tomohiro,nakamura.atsushi}@lab.ntt.co.jp
ABSTRACT
Recently feature compensation techniques that train feature trans-
forms using a discriminative criterion have attracted much interest
in the speech recognition community. Typically, the acoustic fea-
ture space is modeled by a Gaussian mixture model (GMM), and a
feature transform is assigned to each Gaussian of the GMM. Fea-
ture compensation is then performed by transforming features us-
ing the transformation associated with each Gaussian, then summing
up the transformed features weighted by the posterior probability of
each Gaussian. Several discriminative criteria have been investigated
for estimating the feature transformation parameters including max-
imum mutual information (MMI) and minimum phone error (MPE).
Recently, the differenced MMI (dMMI) criterion that generalizes
MMI and MPE, has been shown to provide competitive performance
for acoustic model training. In this paper, we investigate the use of
the dMMI criterion for discriminative feature transforms and demon-
strate in a noisy speech recognition experiment that dMMI achieves
recognition performance superior to that of MMI or MPE.
Index Terms— Speech recognition, discriminative training, dis-
criminative feature transforms, differenced MMI
1. INTRODUCTION
The use of discriminative criteria for training automatic speech
recognition (ASR) systems has become a standard technique. In-
deed, the optimization of such criteria is better correlated to recogni-
tion error reduction than standard maximum likelihood (ML) leading
to a consistent improvement in speech recognition accuracy. Work
on discriminative training approaches started with acoustic model
training [1, 2] and was then extended to language model training [3]
and more recently to feature extraction [4, 5, 6]. In particular, the
use of discriminative training for feature transforms has recently
attracted much attention, because of the significant recognition
performance improvement achieved for many speech recognition
tasks [4, 5, 6, 7]. These approaches share the same concept of using
a Gaussian mixture model (GMM) to model the feature space and
associate feature transformation parameters with each Gaussian of
the GMM. A compensated feature vector is obtained by transform-
ing an input feature vector with the transform associated with each
Gaussian of the GMM, then summing up the transformed features
weighted by the posterior probability of each Gaussian. This makes
*Shinji Watanabe is now with Mitsubishi Electric Research Laboratories
(MERL), watanabe@merl.com.
it possible to employ different transforms for each region of the
feature space [6, 8]. The parameters of the transform associated
with each Gaussian is trained using a discriminative criterion.
Many different discriminative criteria have been proposed for
training acoustic models and feature transform parameters such as
maximum mutual information (MMI) [1, 5], minimum classifica-
tion error (MCE) [9], minimum phone error (MPE) [2, 4] or boosted
MMI (BMMI) [7]. BMMI modifies the MMI criterion by incorporat-
ing margins into the denominator (corresponding to the competitor
contribution) of the MMI objective function and a boosting factor,
further called margin parameter. Recently, a new discriminative cri-
terion called differenced MMI (dMMI) was proposed to generalize
MPE and BMMI [10]. The objective function of dMMI is defined
as the difference between two BMMI objective functions with two
different margin parameters therefore combining the regularization
benefits of BMMI with a loose definition of references [9, 11]. It
was shown in [10] that the dMMI objective function can be derived
from the integration of an MPE objective function over a margin
interval. Consequently depending on the values of the margin pa-
rameters, dMMI becomes equivalent to MMI/BMMI or MPE.
The dMMI criterion has been shown to achieve competitive
performance in various tasks when used for training acoustic mod-
els [10]. In this paper we investigate the use of the dMMI dis-
criminative criterion for training the feature transform parameters.
We demonstrate experimentally that dMMI is more robust to mis-
matches between training and testing conditions, and can provide
superior recognition performance compared with conventional ap-
proaches such as MMI (MMI-SPLICE [5]), MPE (fMPE [4]) and
BMMI (fBMMI [7]). In a similar way to that employed with MMI-
SPLICE, we used a noisy speech recognition task to evaluate our
proposal [5]. In this paper, we use the PASCAL-CHiME challenge
task, which consists of speech command recognition in the presence
of highly non-stationary noise [12].
The organization of the paper is as follows. In section 2 we
review the principle of discriminative feature transforms and derive
the transform parameters estimation using the dMMI criterion. We
then present some experimental results comparing dMMI and MMI
in section 3.
2. DISCRIMINATIVE FEATURE TRANSFORMS
There have been several proposals regarding the implementation of
discriminative feature transforms [4, 5, 6]. These approaches share
the common idea of transforming input feature vectors ot given
4753 978-1-4673-0046-9/12/$26.00 ©2012 IEEE ICASSP 2012