NOISE AND SPEAKER COMPENSATION IN THE LOG FILTER BANK DOMAIN
Vikas Joshi, Raghavendra Bilgi, S. Umesh
Department of Electrical Engineering
Indian Institute of Technology, Madras, India
[ee10s001, ee10s009, umeshs]@ee.iitm.ac.in
L. Garcia, C. Benitez
Dept of Signal Theory, Telematics and Communications
University of Granada, Spain
[luzgm, carmen]@ugr.es
ABSTRACT
In this paper, we propose a method to compensate for noise and
speaker-variability directly in the Log filter-bank (FB) domain, so
that MFCC features are robust to noise and speaker-variations. For
noise-compensation, we use Vector Taylor Series (VTS) approach
in the Log FB domain, and speaker-normalization is also done in
the Log FB domain using Linear Vocal tract length (VTLN) matri-
ces. For VTLN, optimal selection of warp-factor is done in Log
FB domain using canonical GMM model, avoiding the two-pass ap-
proach needed by a HMM model. Further, this can be efficiently
implemented using sufficient statistics obtained from the GMM and
the FB-VTLN-matrices. The warp-factor selection using GMM can
also be done in cepstral domain by applying DCT matrices with-
out the usual approximations associated with conventional linear-
VTLN. The elegance of the proposed approach is that given the
speech data, we obtain directly MFCC features that are robust to
noise and speaker-variations. The proposed approach, show a signif-
icant relative improvement of 31% over baseline on Aurora-4 task.
Index Terms— Speaker Normalization, Noise Compensation,
VTS, TVTLN, Noise and Speaker compensation
1. INTRODUCTION
Automatic speech recognition (ASR) systems are vulnerable to both
Noise and Inter-speaker variations. Several techniques for noise
compensation and speaker normalization have been proposed in
literature and often the efficacy of these methods are studied in
isolation without considering the effect of the other. Recently, there
have been some studies that attempt to compensate both noise and
speaker variability and then investigate their combined effect on
the recognition performance [1][2][3]. However, in most of these
studies, MFCC features are first extracted from noisy speech and
attempts are made to compensate for noise followed by speaker-
normalization. Histogram equalization and Vector Taylor Series
(VTS) are two commonly used techniques for noise-compensation,
while Maximum Likelihood Linear Regression (MLLR) and VTLN
are the commonly used methods used for speaker-normalization. In
order to do speaker-normalization, MLLR/VTLN require an initial
(first-pass) recognition which is used to estimate the normalization
parameters before a final recognition is done, i.e. a two-pass ap-
proach. Recently, linear-VTLN approach has been proposed [4]
which allows VTLN to be implemented as feature-transformation.
However, linear-VTLN warped features are only an approximation
This work was supported under the Indo-Spanish Joint Program of Co-
operation in Science and Technology. The Indian group is supported un-
der the projects DST/INT/SPAIN/P-5 and DST/EECE/058 of Ministry of
Science and Technology. The Spanish Group is supported under project
ACI2009-0892 by the Ministry of Science and Innovation.
to conventional-VTLN warped features since the cepstral features
are truncated to usually 13 coefficients which are then used with
Inverse-DCT.
Speaker Normalized
Features
Speech
Noisy
Noise Compensated and FRONT END SP
BLOCK
( )
VTS + TVTLN
Fig. 1: Single block structure for Noise and Speaker Compensation
In this paper, we propose a method where noise and speaker-
normalization are done during the feature extraction step, so that
given the noisy speech data we obtain MFCC features that are noise
and speaker compensated as illustrated in Fig. 1. In our proposed ap-
proach we use VTS for noise compensation and VTLN for speaker
normalization with both approaches implemented in the Log FB do-
main. In the paper, our studies show that VTS perform better in Log
FB domain compared to cepstral domain and is discussed in the sec-
tion 4. In Section 2.2 we discuss the advantage of warping in Log
FB domain as compared to the cepstral domain. In this approach,
given Log-FB output of noisy speech, VTS returns a cleaned Log
FB output. VTLN is then done by applying linear-VTLN matrix on
the VTS-cleaned Log FB output to give a VTS-cleaned and VTLN-
warped Log FB. Since the VTLN transformation is a square transfor-
mation, there are no truncation errors unlike linear-VTLN in cepstral
domain. Finally, the two-pass approach for speaker-normalization is
avoided by finding the optimal warp-factor with respect to a canon-
ical Gaussian Mixture Model (GMM) built from VTS-cleaned Log
FB coefficients. Further, the likelihood calculation for the optimal
warp-factor can be efficiently implemented using sufficient statistics
and the FB-warp matrices [5].
The paper is organized as follows. Section 2 briefly reviews
the VTS and TVTLN approaches. In section 3 the proposed ap-
proach is presented. Section 4 has the comparison between the Log
FB compensation and Cepstral domain compensation followed by
experimental results and discussion in section 5. Conclusions are
presented in Section 6
2. VTS AND TVTLN IN BRIEF
2.1. VTS noise compensation
The effect of additive noise on the clean speech in Log FB domain
can be modelled as a non-linear transform by [6],
y = x + log(1 + e
(n-x)
)= x + g(x, n) (1)
where y is the noisy speech, x is the clean speech and n is the
additive noise. In Eqn. (1), g(x, n) is the non-linear function added
due to presence of noise. Different variants of VTS exist depending
4709 978-1-4673-0046-9/12/$26.00 ©2012 IEEE ICASSP 2012