928 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007 Gaussian Mixture Clustering and Language Adaptation for the Development of a New Language Speech Recognition System Nikos Chatzichrisafis, Vassilios Diakoloukas, Vassilios Digalakis, and Costas Harizakis Abstract—The porting of a speech recognition system to a new language is usually a time-consuming and expensive process since it requires collecting, transcribing, and processing a large amount of language-specific training sentences. This work presents tech- niques for improved cross-language transfer of speech recognition systems to new target languages. Such techniques are particularly useful for target languages where minimal amounts of training data are available. We describe a novel method to produce a language-independent system by combining acoustic models from a number of source languages. This intermediate language-inde- pendent acoustic model is used to bootstrap a target-language system by applying language adaptation. For our experiments, we use acoustic models of seven source languages to develop a target Greek acoustic model. We show that our technique significantly outperforms a system trained from scratch when less than 8 h of read speech is available. Index Terms—Clustering methods, languages, speech recogni- tion. I. INTRODUCTION D EVELOPING acoustic models for a new language re- quires large amounts of speech samples that need to be collected, transcribed, and processed to efficiently train the parameters of the acoustic model. Such speech databases have been created for major languages, including a variety of speaking conditions and tasks. For new languages, the collection, transcription, and processing of such amounts of training data accounts for the largest portion of the time needed to develop the new acoustic model and represents an important cost factor. Furthermore, some of the European and Asian languages, for which well-trained speech recognizers already exist, represent only a small portion of the hundreds of languages worldwide. The need for the rapid development of speech recognition ap- plications could emerge for many of these languages at any time, Manuscript received December 29, 2005; revised July 9, 2006. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Dr. Dilek Hakkani-Tur. N. Chatzichrisafis was with the Department of Electronics and Computer Engineering, Technical University of Crete, 73100 Chania, Greece. He is now with the Translation Department, Geneva University, Geneva 1211, Switzerland (e-mail: Nikos.Chatzichrisafis@vozZup.com). V. Diakoloukas, V. Digalakis, and C. Harizakis are with the Department of Electronics and Computer Engineering, Technical University of Crete, 73100 Chania, Greece (e-mail: vas@telecom.tuc.gr; vdiak@telecom.tuc.gr; harizak@speech.gr). Digital Object Identifier 10.1109/TASL.2006.885259 based on the continuously varying economic and political situ- ation. To alleviate the development burden for new acoustic models, several techniques have been proposed in the literature. All of these techniques are applied in three phases, which are presented below. In the first phase, cross-language phone mappings that iden- tify similar speech sounds across languages have to be obtained. In [1], [2], and [3], this is accomplished using knowledge-based methodology which relies only on acoustic-phonetic categoriza- tions. These categorizations are based on the articulatory repre- sentations of the phonemes across languages, which have been defined for several languages worldwide by several organiza- tions as in [4] and [5]. An alternative, automatic approach is proposed in [1] based on confusion matrices. The automatic approach allows subphonetic mappings as well; however, it re- quires some data from the target-language. In the second phase, a language-independent (LI) acoustic model is constructed using resources such as speech data and acoustic model parameters from several source languages, as well as the cross-language phone or subphone mappings ob- tained in the previous phase. There are different strategies that have been proposed for the construction of the language-inde- pendent model. In [1], each phone or subphone model in the inventory of the target language is constructed from the most similar phone or subphone model among the source languages. In [2], the language-independent phone models for each of the IPA symbols were trained using the corresponding training data from the source languages. A similar technique is described in [3], although the training process of the models differs. Two other approaches are investigated in [3]. In the first, each lan- guage-specific phoneme is trained with data from its own lan- guage, thus the trained system consists of a large number of phone models. The only multilingual component applied is a global linear discriminant analysis (LDA) matrix, which is ap- plied to reduce the size of the feature vectors of all source lan- guages [6]. Alternatively, the Gaussian components of the mix- tures are trained with data from all languages and shared across the language phone-models for each common IPA symbol, but the mixture weights remain specific for each language model. In all of the above approaches, the training data of the source languages should be available. In the last phase, the final target-language acoustic model is constructed using the language-independent model to bootstrap a training process, when sufficient amounts of target-language 1558-7916/$25.00 © 2006 IEEE