Local Transformation Models for Speech Recognition Antonio Miguel, Eduardo Lleida, Alfons Juan*, Luis Buera, Alfonso Ortega and Oscar Saz I3A, University of Zaragoza, Spain *DSIC, Polytechnic University of Valencia, Spain {amiguel,lleida,lbuera,ortega,oskarsaz}@unizar.es ajuan@dsic.upv.es Abstract This paper presents a novel acoustic modeling framework that naturally extends the Hidden Markov Model (HMM) approach. The novel models reduce the errors caused by speaker variability by means of a local spectral mismatch reduction. A more com- plex and ﬂexible speech production scheme can be assumed, in which the local temporal and frequency elastic deformations of the speech are captured by the model. In the new framework the states of a standard HMM, which are usually associated with temporal transitions, are expanded so that a new degree of freedom for the model is provided and it is then possible to estimate an optimum frequency warping factor at the same time as the decoder ﬁnds the best state sequence. In the local spectral warping based models the states become time-frequency related states and the number of parameters of the model is comparable to the standard HMM since they share a certain amount of parameters as it will be shown. The novel models are evaluated in the noise-free TIDIGITS corpus, which includes connected digits uttered by male, female and chil- dren. It has been found that, under speaker group (age-gender) mismatch conditions, the local frequency warping reduced Word Error Rate (WER) in mean by a 70%, using the initial models. When matched speaker group conditions were tested the error was reduced in mean in a 9.7% after reestimating the models. Index Terms: speaker variability, local frequency warping. 1. Introduction A speech modeling technique for speaker variability reduction is investigated in this paper, since this variability has a great interest due to the impact on the accuracy of Automatic Speech Recog- nition, ASR, systems. It will be shown that the model presented provides a mechanism to reduce the error on ASR for a wide range of local deformations of the speech parameters across the time and frequency axes. Standard techniques as Hidden Markov Models (HMM) pro- vide a successful reduction of the speaker variability in terms of temporal variability thanks to the time alignment of the utterances to the models by the Viterbi algorithm, capturing the essential information needed for speech recognition tasks. In the HMM framework there also exists a basic mechanism to model the fre- quency variability due to speaker, which causes changes in the vo- cal tract shape. It is provided by the state dependent observation generating process, which usually is assumed to follow a probabil- ity density function pdf as a Gaussian Mixture Model (GMM) The vocal tract shape deviations due to a large population of speakers are captured by the state pdfs as different components of the mix- ture. Then, a number of examples from each one of the shapes are This work has been supported by the national project TIN 2005- 08660-C04-01 needed so that the components of the mixture can be estimated in the learning process. Therefore a large amount of Gaussian com- ponents and training data are required in order to deal with this source of variability in a simple HMM. Some methods have appeared in order to compensate more accurately both sources of variability specially in the frequency axis. In this paper we focus the experiments towards this kind of speaker variability, manifested as the frequency deformations of the spectrum envelope that occur in speaker independent ASR tasks, which are known to have its origin in the vocal tract and ar- ticulatory instant shapes. Some methods have appeared previously in order to compensate for speaker frequency variability as Vocal Tract Length Normalization, VTLN, [1, 2] and Maximum Likeli- hood Linear Regression, MLLR, [3], which reduce the mismatch between data an model, but those methods compensate the mis- match given previous utterances and transcriptions or extra speaker dependent training data. The model framework, referred from here as the augMented stAte space acousTic dEcoder/modEl (MATE), consists of an expansion of the VTLN methods to provide local transformations to be locally optimized, simultaneously to the de- coding of the state sequence in an expanded search trellis. The training and the testing of MATE is speaker independent, since it is expected to capture part of the speaker variability by means of the expanded state space and the inter-transformation transitions. The ﬁrst approaches to this paradigm were envisioned in [4] and then followed by [5, 6] in a more general approach. Those methods were intended to normalize the speech signal to be bet- ter accepted by the model. The model presented in this paper is an evolution of them and the transformation is embedded into the model, allowing a more general formulation and derivation of the model parameter estimation expressions, as it will be shown in sec- tions 3 and 3.2. The transformations of MATE described in this ar- ticle are a valid generalization of [5] in both sources of variability, time and frequency but, as the effect of the local temporal warping is less noticeable unless a stressed or pathological speech corpus is tested, the experiments in this article are going to be oriented to show speaker independent ASR improvements in the sense of frequency transformations in the new MATE framework. The paper is organized as follows. Section 2 reviews the the existing techniques used for speaker mismatch reduction. Section 3 presents the model formulation and the procedure for estimating the model parameters using the EM algorithm. Section 4 includes the results of an experimental study of the new models. Finally, discussion and conclusions are presented in Section 5. 2. Speaker mismatch reduction model based methods Basic HMM provides a simple, but effective under certain condi- tions, mechanism of modeling speaker variability which consists INTERSPEECH 2006 - ICSLP September 17-21, Pittsburgh, Pennsylvania 1598