CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE Anjana Vakil and Alexis Palmer University of Saarland Department of Computational Linguistics and Phonetics Saarbr¨ ucken, Germany ABSTRACT For small-vocabulary applications, a mapped pronuncia- tion lexicon can enable speech recognition in a target under- resourced language using an out-of-the-box recognition en- gine for a high-resource source language. Existing algorithms for cross-language phoneme mapping enable the fully auto- matic creation of such lexicons using just a few minutes of au- dio, making speech-driven applications in any language feasi- ble. What such methods have not considered is whether care- ful selection of the source language based on the linguistic properties of the target language can improve recognition ac- curacy; this paper reports on a preliminary exploration of this question. Results from a ﬁrst case study seem to indicate that phonetic similarity between target and source language does not signiﬁcantly impact accuracy, underscoring the language- independence of such techniques. Index Terms— under-resourced languages, speech recog- nition, lexicon building, phoneme mapping 1. INTRODUCTION In recent years it has been demonstrated that speech recogni- tion interfaces can be extremely beneﬁcial for applications in the developing world, particularly in communities where lit- eracy rates are low or where PCs and internet connections are not always available [1, 2, 3]. Typically, the languages spo- ken in such communities are under-resourced, such that the large audio corpora typically needed to train or adapt recog- nition engines are unavailable. However, in the absence of a recognition engine trained for the target under-resourced language (URL), an existing recognizer for a completely un- related high-resource language (HRL), such as English, can be used to perform small-vocabulary recognition tasks in the URL. All that is needed is a pronunciation lexicon mapping each term in the target vocabulary to one or more sequences of phonemes in the HRL, i.e. phonemes which the recognizer can model. While the mapped pronunciations could be hand-written by an expert linguist familiar with the two languages, algo- rithms such as the “Salaam” method [3, 4, 5] can create these pronunciations automatically from just a few minutes of data, and have been shown to yield higher recognition accuracy than is achieved with hand-coded pronunciations [3, 4]. The automatic technique also has the advantage of not depending on any expert knowledge of the source or target language or the relationship between them. However, it is conceivable that the recognition accuracy for a given target URL will vary depending on the source HRL used, as the source/target combination will determine the degree to which the sound systems of the two languages differ, and thus the difﬁculty of the pronunciation mapping task. More speciﬁcally, we expect that by carefully selecting the source language such as to maximize the overlap between its phoneme inventory and that of the target (under-resourced) language, we can reduce the difﬁculty of phoneme mapping and thereby ﬁnd better pronunciation sequences for the target terms, which should lead to increased accuracy in the recog- nition task. We have begun to test this hypothesis by comparing recognition results for pronunciations generated for words in a target URL (Yoruba) using the Salaam method with two different source HRL recognizers (English and French). The aim of this paper is to present our experiment and ﬁndings, and discuss their implications for language-mapping tech- niques for language-independent small-vocabulary ASR. 2. BACKGROUND AND RELATED WORK Many commercial speech recognition systems offer high- level Application Programming Interfaces (APIs) that make adding voice recognition capabilities to an application as sim- ple as specifying (in text) the words/phrases that should be recognized; this requires very little general technical exper- tise, and virtually no knowledge of the inner workings of the recognition engine. If the target language is supported by the system – Microsoft’s Speech Platform, for example, currently supports recognition and synthesis for 26 languages/dialects [6] – this makes it very easy for small-scale software devel- opers (i.e. individuals or small organizations without much funding) to create new speech-driven applications.