Implementation of VTLN for Statistical Speech Synthesis Lakshmi Saheer 1,2 , John Dines 1 , Philip N. Garner 1 , Hui Liang 1,2 1 Idiap Research Institute, Martigny, Switzerland 2 Ecole Polytechnique F´ ed´ erale de Lausanne, Switzerland lsaheer@idiap.ch, dines@idiap.ch, pgarner@idiap.ch, hliang@idiap.ch Abstract Vocal tract length normalization is an important feature nor- malization technique that can be used to perform speaker adap- tation when very little adaptation data is available. It was shown earlier that VTLN can be applied to statistical speech synthesis and was shown to give additive improvements to CMLLR. This paper presents an EM optimization for estimating more accu- rate warping factors. The EM formulation helps to embed the feature normalization in the HMM training. This helps in es- timating the warping factors more efﬁciently and enables the use of multiple (appropriate) warping factors for different state clusters of the same speaker. Index Terms: Vocal tract length normalization, Expectation Maximization Optimization, HMM Synthesis, Adaptation 1. Introduction Hidden Markov model (HMM) is a popular technique used in automatic speech recognition (ASR). Speaker independent (SI) models are built by estimating the parameters of HMM using data collected from a large number of speakers. Model adap- tation techniques entail linear transformation of the means and variances of an HMM to match the characteristics of the speech for a given speaker. The same techniques can be used to remove the inter-speaker variability in the training data. The resulting speaker adaptive (SAT) models have better performance than the SI models in ASR. Feature adaptation, on the other hand, transforms the feature vectors rather than the model parame- ters. The effects of model adaptation can be accomplished to some extent using feature adaptation techniques (also widely known as speaker normalization techniques). The main advan- tage of speaker normalization is that the number of parameters to be estimated from the adaptation data is generally smaller compared to the standard model based adaptation techniques. Hence, adaptation can be carried out with very little adaptation data. Recently, HMMs have been shown to be capable of per- forming TTS too, and with care can produce synthetic speech of a quality comparable to unit selection. This in turn brings the possibilities of adaptation to TTS [1]. A stored average voice can be transformed to sound like a voice represented by the transform for a given speaker. Such transforms are typically lin- ear transforms similar to the ones used in ASR. Speaker normal- ization techniques can also be used in TTS to generate adapted speech using very little adaptation data; of the order of a few minutes. Vocal tract length normalization (VTLN) is inspired from the physical observation that the vocal tract length (VTL) varies across different speakers in the range of around 18 cm in males to around 13 cm in females. The formant frequency positions are inversely proportional to VTL, and hence can vary around 25%. Although implementation details differ, VTLN is gener- ally characterized by a single parameter that warps the spectrum towards that of an average vocal tract in much the same way that maximum likelihood linear regression (MLLR) transforms can warp towards an average voice. An efﬁcient implementation of VTLN using expectation maximization (EM) with Brent’s search optimization for syn- thesis is presented in this paper. Optimal warping factors for synthesis are analyzed, and techniques to estimate similar warp- ing factors from the model are examined. Problems with Ja- cobian normalization for VTLN warping factor estimation are brieﬂy discussed along with a technique that achieves best per- formance for synthesis. This paper also investigates the multi- class EM-VTLN estimation in the context of statistical synthe- sis. The features used for statistical speech synthesis have very high dimensionality (of the order of 25 or 39) when compared to ASR features. There are some issues with VTLN estimation for higher order features which were presented in earlier work [2] and further investigated here. 2. VTLN The main components involved in VTLN are: a warping func- tion, a warping factor and an optimization criterion. The all- pass transform approximates most commonly used transforma- tions in VTLN [3, 4]. The bilinear transform based warping function has only a single variable α as the warping factor which is representative of the ratio of the VTL of the speaker to the average VTL. The terms warping factor and ‘α’ refer to the same parameter and are used interchangeably throughout this paper. A brute force way of computing the warping factor for each speaker is the maximum likelihood (ML) based grid search technique. ML optimization is given by [5]: ˆ α s = arg max α p(x αs | Θ, w s )p(α | Θ) (1) where x αs represents the features warped with the warping fac- tor α s , which is the warping factor for speaker s. Θ represents the model and w s represents the transcription corresponding to the data from which the features are extracted for speaker s.ˆ α s represents the best warping factor for the same speaker. p(α|Θ) is the prior probability of α for a given model. Preliminary results using VTLN in statistical speech syn- thesis are presented in [2]. The bilinear transform based warp- ing function is used in an ML optimization framework using a grid search technique. The all-pass transform based normaliza- tion is applied to the mel-generalized cepstral (MGCEP) fea- tures that are commonly used in statistical speech synthesis. It is shown that VTLN brings in some speaker characteristics and provides additive improvements to CMLLR, especially when there is a limited number of adaptation utterances. In [6], it is