CROSSLINGUAL ACOUSTIC MODEL DEVELOPMENT FOR AUTOMATIC SPEECH RECOGNITION Frank Diehl, Asunci´ on Moreno, and Enric Monte TALP Research Center Universitat Polit` ecnica de Catalunya (UPC) Jordi Girona 1-3, 08034 Barcelona, Spain {frank,asuncion,enric}@gps.tsc.upc.edu ABSTRACT In this work we discuss the development of two crosslingual acous- tic model sets for automatic speech recognition (ASR). The start- ing point is a set of multilingual Spanish-English-German hidden Markov models (HMMs). The target languages are Slovenian and French. During the discussion the problem of deﬁning a multilingual phoneme set and the associated dictionary mapping is considered. A method is described to circumvent related problems. The impact of the acoustic source models on the performance of the target systems is analyzed in detail. Several crosslingual deﬁned target systems are built and compared to their monolingual counterparts. It is shown that crosslingual build acoustic models clearly outperform pure monolin- gual models if only a limited amount of target data is available. Index Terms— crosslingual, acoustic modelling 1. INTRODUCTION Enterprises engaged in ASR are usually faced with the question of globalizing their products. This does not only concern big interna- tional companies but also smaller business. Companies which are operating telephone assistant systems or automobile manufacturers demand from their suppliers system components that can be used worldwide. This may mean monolingual operability for multiple lan- guages but also multilingual usability for multilingual markets or ap- plications. As state-of-the-art ASR technology greatly relies on the avail- ability of adequate language resources, big efforts were undertaken to construct and distribute publically available speech and text databases. Although these efforts were highly successful in terms of covered lan- guages and environmental conditions, companies are still faced with the problem of unavailable training data and the inﬂexible handling of new languages. A typical scenario is the demand to extend an ASR system to a minority language which is not yet covered by available databases, or, a speech database in the target language is available but does not match the environmental or dialectal conditions of the target application. In this work we address the issue of porting an ASR system from one language to an other. We examine two target languages, Slovenian, and French, and assume that a limited amount of speech material in these target languages is available. The acoustic models of a multilingual Spanish-English-German system serve as a starting point. The chosen application scenario consists of a typical medium scale task, trying to recognize a list of so-called phonetically rich words, and application words. For the experiments, tied-mixture HMMs This work was granted by the CICYT under contract TIC2006-13694- C03-01/TCM and contract TIN2005-08852. are used, also reﬂecting the idea of a medium scale, or even embedded application. 2. BASIC CONCEPTS With few exceptions, [1], recent work on crosslingual acoustic mod- elling assumes the availability of a certain, though limited, amount of speech material in the target language. Under the additional pre- sumption that speech material and some well formed acoustic models of one or more source languages are available, three main research lines for crosslingual modelling can be identiﬁed. They are: • Feature compensation • Model combination • Model adaptation In feature compensation the focus lies directly on the acoustic data. The main idea is to transform speech material from a source language to the feature space of the target language, [2], [3]. As a result the sparse target language speech material is augmented, broadening the database for the subsequent HMM training. As feature compensation acts on the feature stream prior to acoustic model deﬁnition and train- ing we name it a pre-processing technique. The approach of model combination is quite contrary to feature compensation. Instead of building dedicated acoustic models for the target language, acoustic models of several source languages are cho- sen. That is, multiple source language ASR systems are run in par- allel, each conﬁgured to recognize the target language. In a post- processing step the hypotheses of all systems are then combined, and the task is to extract the best from each outcome. For the post- processing ROVER [3] or discriminative model combination (DMC) [4] was explored. Model adaptation may be seen as an intermediate technique, lo- cated between feature compensation and model combination. Dif- ferences in the acoustics between languages are seen as an acoustic mismatch problem similar to the one of speaker adaptation. Thus, instead of directly acting on the acoustic data (as in case of feature compensation), classical model adaptation techniques are applied to port the acoustic models of the source language to the target language [5], [6]. In contrast to model combination, only one source model set is used. This model set might be the one of a dedicated source lan- guage, or, preferably, a multilingual model set based on several source languages. In addition to the acoustic mismatch, crosslingual problems ex- hibit also a structural mismatch. Caused by the different phoneme sets and the different phonotactics of the involved languages, a lan- guage speciﬁc deﬁnition of the acoustic model set is needed. To over- come this problem an adaptation of the model set by so-called poly-