Cross-language Synthesis with a Polyglot Synthesizer Javier Latorre, Koji Iwano, Sadaoki Furui Department of Computer Science Tokyo Institute of Technology, Tokyo, Japan {latorre,iwano,furui}@furui.cs.titech.ac.jp Abstract In this paper we examine the use of an HMM-based polyglot synthesizer for languages for which very limited or no speech data is available. In a former study, we presented a system that combines monolingual corpora from several languages to create a polyglot synthesizer. With this synthesizer we can synthesize any of the languages included in the training data with the same output voice and speech quality. In this paper, we approximate the sounds of non-included languages, by those available in the polyglot training data. Since the phonetic inventory of a polyglot synthesizer is wider than that of a monolingual one, the approximation of such non-included sounds becomes more accurate and thus the perceptual intelligibility increases. Moreover, the performance of a polyglot synthesizer can be further improved by adding a reduced amount of data from the target language. 1. Introduction To develop a speech synthesizer in a new language is still a substantial task. In many cases, it requires large investments that nowadays are only profitable for a dozen or so languages. A possible solution to reduce the implementation costs is to reutilize speech resources from other languages. Most proposals in this direction are based on a phone mapping, which approximates the sounds of the target language by those of a similar language with an available speech corpus, e.g. [1]. Another possible solution is to use a polyglot synthesizer [2]. The wider “palette” of sounds available in a polyglot synthesizer with respect to a monolingual one, can make it easier to find appropriate candidates for the sounds of the target language. In this way, the approximated sounds can be closer to the real ones and the intelligibility of the synthesized speech increased. In [3], we proposed a new approach to polyglot synthesis consisting in training a language independent HMM-based synthesizer with speech resources from several languages. For cross-language speech recognition, it was shown that a multilingual recognizer built in this way can outperform even the best-matched language dependent recognizer [4]. Moreover, if a small amount of data from the target language becomes available, it can be used to improve the performance of such HMM-based polyglot synthesizer. This can be done by adapting with it the polyglot synthesizer to the new language [5], or by including this new data in the training of the polyglot synthesizer. 2. HMM-based polyglot speech synthesis A polyglot synthesizer is a system that can generate speech in different languages with the same voice. The two main approaches were: a) to record a corpus from a polyglot speaker [2] or b) to make a phonetic mapping between the phones of the language we want to synthesize and the phones available in the database [6]. In [3] we proposed a new approach that consists in combining monolingual corpora from several speakers in different language to train a language independent and speaker independent HMM-based synthesizer. The central assumption of our approach is that the average voice created by mixing data from several speakers tends to be language independent and therefore it can be considered as a polyglot voice. Figure 1 shows the general schema of our system. Since in our method no human polyglot talent is required we can expand it to any number of languages we want. Furthermore, since no phone mapping is needed for the languages included in the mixture, the perceptual intelligibility and the level of foreign accent when synthesizing these languages is lower than with other methods based on phone mapping. The problem of synthesizing speech from an average voice is that it usually sounds impersonal. Moreover, there can be a lack of coherence in the resulting output voice, because not all the models are trained with data from the same speakers. To solve these two problems, we apply supervised Maximum Likelihood Linear Regression (MLLR) to adapt the average voice to the voice of a target speaker. Finally, we apply a synthesis algorithm [7] to the adapted HMM to generate speech in any of the training languages, independent of the language spoken by the target speaker. Figure 1: General schema of an HMM-based polyglot synthesizer. HMM TRAINING SUPERVISED MLLR ADAPTATION Adaptation data: Speaker S in language L2 HMM-BASED SYNTHESIS Language L2 Language L1 Corpus for 10 speakers SI Polyglot HMM SD Polyglot HMM Text in language L1 Synthetic speech INTERSPEECH 2005 1477 September, 4-8, Lisbon, Portugal