928 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007
Gaussian Mixture Clustering and Language
Adaptation for the Development of a New
Language Speech Recognition System
Nikos Chatzichrisafis, Vassilios Diakoloukas, Vassilios Digalakis, and Costas Harizakis
Abstract—The porting of a speech recognition system to a new
language is usually a time-consuming and expensive process since
it requires collecting, transcribing, and processing a large amount
of language-specific training sentences. This work presents tech-
niques for improved cross-language transfer of speech recognition
systems to new target languages. Such techniques are particularly
useful for target languages where minimal amounts of training
data are available. We describe a novel method to produce a
language-independent system by combining acoustic models from
a number of source languages. This intermediate language-inde-
pendent acoustic model is used to bootstrap a target-language
system by applying language adaptation. For our experiments, we
use acoustic models of seven source languages to develop a target
Greek acoustic model. We show that our technique significantly
outperforms a system trained from scratch when less than 8 h of
read speech is available.
Index Terms—Clustering methods, languages, speech recogni-
tion.
I. INTRODUCTION
D
EVELOPING acoustic models for a new language re-
quires large amounts of speech samples that need to
be collected, transcribed, and processed to efficiently train
the parameters of the acoustic model. Such speech databases
have been created for major languages, including a variety
of speaking conditions and tasks. For new languages, the
collection, transcription, and processing of such amounts of
training data accounts for the largest portion of the time needed
to develop the new acoustic model and represents an important
cost factor.
Furthermore, some of the European and Asian languages, for
which well-trained speech recognizers already exist, represent
only a small portion of the hundreds of languages worldwide.
The need for the rapid development of speech recognition ap-
plications could emerge for many of these languages at any time,
Manuscript received December 29, 2005; revised July 9, 2006. The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Dilek Hakkani-Tur.
N. Chatzichrisafis was with the Department of Electronics and Computer
Engineering, Technical University of Crete, 73100 Chania, Greece. He is now
with the Translation Department, Geneva University, Geneva 1211, Switzerland
(e-mail: Nikos.Chatzichrisafis@vozZup.com).
V. Diakoloukas, V. Digalakis, and C. Harizakis are with the Department
of Electronics and Computer Engineering, Technical University of Crete,
73100 Chania, Greece (e-mail: vas@telecom.tuc.gr; vdiak@telecom.tuc.gr;
harizak@speech.gr).
Digital Object Identifier 10.1109/TASL.2006.885259
based on the continuously varying economic and political situ-
ation.
To alleviate the development burden for new acoustic models,
several techniques have been proposed in the literature. All of
these techniques are applied in three phases, which are presented
below.
In the first phase, cross-language phone mappings that iden-
tify similar speech sounds across languages have to be obtained.
In [1], [2], and [3], this is accomplished using knowledge-based
methodology which relies only on acoustic-phonetic categoriza-
tions. These categorizations are based on the articulatory repre-
sentations of the phonemes across languages, which have been
defined for several languages worldwide by several organiza-
tions as in [4] and [5]. An alternative, automatic approach is
proposed in [1] based on confusion matrices. The automatic
approach allows subphonetic mappings as well; however, it re-
quires some data from the target-language.
In the second phase, a language-independent (LI) acoustic
model is constructed using resources such as speech data and
acoustic model parameters from several source languages, as
well as the cross-language phone or subphone mappings ob-
tained in the previous phase. There are different strategies that
have been proposed for the construction of the language-inde-
pendent model. In [1], each phone or subphone model in the
inventory of the target language is constructed from the most
similar phone or subphone model among the source languages.
In [2], the language-independent phone models for each of the
IPA symbols were trained using the corresponding training data
from the source languages. A similar technique is described in
[3], although the training process of the models differs. Two
other approaches are investigated in [3]. In the first, each lan-
guage-specific phoneme is trained with data from its own lan-
guage, thus the trained system consists of a large number of
phone models. The only multilingual component applied is a
global linear discriminant analysis (LDA) matrix, which is ap-
plied to reduce the size of the feature vectors of all source lan-
guages [6]. Alternatively, the Gaussian components of the mix-
tures are trained with data from all languages and shared across
the language phone-models for each common IPA symbol, but
the mixture weights remain specific for each language model.
In all of the above approaches, the training data of the source
languages should be available.
In the last phase, the final target-language acoustic model is
constructed using the language-independent model to bootstrap
a training process, when sufficient amounts of target-language
1558-7916/$25.00 © 2006 IEEE