NOISE AND SPEAKER COMPENSATION IN THE LOG FILTER BANK DOMAIN Vikas Joshi, Raghavendra Bilgi, S. Umesh Department of Electrical Engineering Indian Institute of Technology, Madras, India [ee10s001, ee10s009, umeshs]@ee.iitm.ac.in L. Garcia, C. Benitez Dept of Signal Theory, Telematics and Communications University of Granada, Spain [luzgm, carmen]@ugr.es ABSTRACT In this paper, we propose a method to compensate for noise and speaker-variability directly in the Log ﬁlter-bank (FB) domain, so that MFCC features are robust to noise and speaker-variations. For noise-compensation, we use Vector Taylor Series (VTS) approach in the Log FB domain, and speaker-normalization is also done in the Log FB domain using Linear Vocal tract length (VTLN) matri- ces. For VTLN, optimal selection of warp-factor is done in Log FB domain using canonical GMM model, avoiding the two-pass ap- proach needed by a HMM model. Further, this can be efﬁciently implemented using sufﬁcient statistics obtained from the GMM and the FB-VTLN-matrices. The warp-factor selection using GMM can also be done in cepstral domain by applying DCT matrices with- out the usual approximations associated with conventional linear- VTLN. The elegance of the proposed approach is that given the speech data, we obtain directly MFCC features that are robust to noise and speaker-variations. The proposed approach, show a signif- icant relative improvement of 31% over baseline on Aurora-4 task. Index Terms— Speaker Normalization, Noise Compensation, VTS, TVTLN, Noise and Speaker compensation 1. INTRODUCTION Automatic speech recognition (ASR) systems are vulnerable to both Noise and Inter-speaker variations. Several techniques for noise compensation and speaker normalization have been proposed in literature and often the efﬁcacy of these methods are studied in isolation without considering the effect of the other. Recently, there have been some studies that attempt to compensate both noise and speaker variability and then investigate their combined effect on the recognition performance [1][2][3]. However, in most of these studies, MFCC features are ﬁrst extracted from noisy speech and attempts are made to compensate for noise followed by speaker- normalization. Histogram equalization and Vector Taylor Series (VTS) are two commonly used techniques for noise-compensation, while Maximum Likelihood Linear Regression (MLLR) and VTLN are the commonly used methods used for speaker-normalization. In order to do speaker-normalization, MLLR/VTLN require an initial (ﬁrst-pass) recognition which is used to estimate the normalization parameters before a ﬁnal recognition is done, i.e. a two-pass ap- proach. Recently, linear-VTLN approach has been proposed [4] which allows VTLN to be implemented as feature-transformation. However, linear-VTLN warped features are only an approximation This work was supported under the Indo-Spanish Joint Program of Co- operation in Science and Technology. The Indian group is supported un- der the projects DST/INT/SPAIN/P-5 and DST/EECE/058 of Ministry of Science and Technology. The Spanish Group is supported under project ACI2009-0892 by the Ministry of Science and Innovation. to conventional-VTLN warped features since the cepstral features are truncated to usually 13 coefﬁcients which are then used with Inverse-DCT. Speaker Normalized Features Speech Noisy Noise Compensated and FRONT END SP BLOCK ( ) VTS + TVTLN Fig. 1: Single block structure for Noise and Speaker Compensation In this paper, we propose a method where noise and speaker- normalization are done during the feature extraction step, so that given the noisy speech data we obtain MFCC features that are noise and speaker compensated as illustrated in Fig. 1. In our proposed ap- proach we use VTS for noise compensation and VTLN for speaker normalization with both approaches implemented in the Log FB do- main. In the paper, our studies show that VTS perform better in Log FB domain compared to cepstral domain and is discussed in the sec- tion 4. In Section 2.2 we discuss the advantage of warping in Log FB domain as compared to the cepstral domain. In this approach, given Log-FB output of noisy speech, VTS returns a cleaned Log FB output. VTLN is then done by applying linear-VTLN matrix on the VTS-cleaned Log FB output to give a VTS-cleaned and VTLN- warped Log FB. Since the VTLN transformation is a square transfor- mation, there are no truncation errors unlike linear-VTLN in cepstral domain. Finally, the two-pass approach for speaker-normalization is avoided by ﬁnding the optimal warp-factor with respect to a canon- ical Gaussian Mixture Model (GMM) built from VTS-cleaned Log FB coefﬁcients. Further, the likelihood calculation for the optimal warp-factor can be efﬁciently implemented using sufﬁcient statistics and the FB-warp matrices [5]. The paper is organized as follows. Section 2 brieﬂy reviews the VTS and TVTLN approaches. In section 3 the proposed ap- proach is presented. Section 4 has the comparison between the Log FB compensation and Cepstral domain compensation followed by experimental results and discussion in section 5. Conclusions are presented in Section 6 2. VTS AND TVTLN IN BRIEF 2.1. VTS noise compensation The effect of additive noise on the clean speech in Log FB domain can be modelled as a non-linear transform by [6], y = x + log(1 + e (n-x) )= x + g(x, n) (1) where y is the noisy speech, x is the clean speech and n is the additive noise. In Eqn. (1), g(x, n) is the non-linear function added due to presence of noise. Different variants of VTS exist depending 4709 978-1-4673-0046-9/12/$26.00 ©2012 IEEE ICASSP 2012