DISCRIMINATIVE FEATURE TRANSFORMS USING DIFFERENCED MAXIMUM MUTUAL INFORMATION Marc Delcroix, Atsunori Ogawa, Shinji Watanabe*, Tomohiro Nakatani, Atsushi Nakamura NTT Communication Science Laboratories, NTT corporation, 2-4, Hikaridai, Seika-cho (Keihanna Science City), Soraku-gun, Kyoto 619-0237 Japan {marc.delcroix,ogawa.atsunori,nakatani.tomohiro,nakamura.atsushi}@lab.ntt.co.jp ABSTRACT Recently feature compensation techniques that train feature trans- forms using a discriminative criterion have attracted much interest in the speech recognition community. Typically, the acoustic fea- ture space is modeled by a Gaussian mixture model (GMM), and a feature transform is assigned to each Gaussian of the GMM. Fea- ture compensation is then performed by transforming features us- ing the transformation associated with each Gaussian, then summing up the transformed features weighted by the posterior probability of each Gaussian. Several discriminative criteria have been investigated for estimating the feature transformation parameters including max- imum mutual information (MMI) and minimum phone error (MPE). Recently, the differenced MMI (dMMI) criterion that generalizes MMI and MPE, has been shown to provide competitive performance for acoustic model training. In this paper, we investigate the use of the dMMI criterion for discriminative feature transforms and demon- strate in a noisy speech recognition experiment that dMMI achieves recognition performance superior to that of MMI or MPE. Index TermsSpeech recognition, discriminative training, dis- criminative feature transforms, differenced MMI 1. INTRODUCTION The use of discriminative criteria for training automatic speech recognition (ASR) systems has become a standard technique. In- deed, the optimization of such criteria is better correlated to recogni- tion error reduction than standard maximum likelihood (ML) leading to a consistent improvement in speech recognition accuracy. Work on discriminative training approaches started with acoustic model training [1, 2] and was then extended to language model training [3] and more recently to feature extraction [4, 5, 6]. In particular, the use of discriminative training for feature transforms has recently attracted much attention, because of the significant recognition performance improvement achieved for many speech recognition tasks [4, 5, 6, 7]. These approaches share the same concept of using a Gaussian mixture model (GMM) to model the feature space and associate feature transformation parameters with each Gaussian of the GMM. A compensated feature vector is obtained by transform- ing an input feature vector with the transform associated with each Gaussian of the GMM, then summing up the transformed features weighted by the posterior probability of each Gaussian. This makes *Shinji Watanabe is now with Mitsubishi Electric Research Laboratories (MERL), watanabe@merl.com. it possible to employ different transforms for each region of the feature space [6, 8]. The parameters of the transform associated with each Gaussian is trained using a discriminative criterion. Many different discriminative criteria have been proposed for training acoustic models and feature transform parameters such as maximum mutual information (MMI) [1, 5], minimum classifica- tion error (MCE) [9], minimum phone error (MPE) [2, 4] or boosted MMI (BMMI) [7]. BMMI modifies the MMI criterion by incorporat- ing margins into the denominator (corresponding to the competitor contribution) of the MMI objective function and a boosting factor, further called margin parameter. Recently, a new discriminative cri- terion called differenced MMI (dMMI) was proposed to generalize MPE and BMMI [10]. The objective function of dMMI is defined as the difference between two BMMI objective functions with two different margin parameters therefore combining the regularization benefits of BMMI with a loose definition of references [9, 11]. It was shown in [10] that the dMMI objective function can be derived from the integration of an MPE objective function over a margin interval. Consequently depending on the values of the margin pa- rameters, dMMI becomes equivalent to MMI/BMMI or MPE. The dMMI criterion has been shown to achieve competitive performance in various tasks when used for training acoustic mod- els [10]. In this paper we investigate the use of the dMMI dis- criminative criterion for training the feature transform parameters. We demonstrate experimentally that dMMI is more robust to mis- matches between training and testing conditions, and can provide superior recognition performance compared with conventional ap- proaches such as MMI (MMI-SPLICE [5]), MPE (fMPE [4]) and BMMI (fBMMI [7]). In a similar way to that employed with MMI- SPLICE, we used a noisy speech recognition task to evaluate our proposal [5]. In this paper, we use the PASCAL-CHiME challenge task, which consists of speech command recognition in the presence of highly non-stationary noise [12]. The organization of the paper is as follows. In section 2 we review the principle of discriminative feature transforms and derive the transform parameters estimation using the dMMI criterion. We then present some experimental results comparing dMMI and MMI in section 3. 2. DISCRIMINATIVE FEATURE TRANSFORMS There have been several proposals regarding the implementation of discriminative feature transforms [4, 5, 6]. These approaches share the common idea of transforming input feature vectors ot given 4753 978-1-4673-0046-9/12/$26.00 ©2012 IEEE ICASSP 2012