Kullback-Leibler divergence-based ASR training data selection Evandro Gouvˆ ea European Media Laboratory GmbH, Heidelberg, Germany evandro.gouvea@alumni.cmu.edu Marelie H. Davel Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Africa marelie.davel@gmail.com Abstract Data preparation and selection affects systems in a wide range of complexities. A system built for a resource-rich language may be so large as to include borrowed languages. A system built for a resource-scarce language may be affected by how carefully the training data is selected and produced. Accuracy is affected by the presence of enough samples of qualitatively relevant information. We propose a method using the Kullback-Leibler divergence to solve two problems related to data preparation: the ordering of alternate pronunciations in a lexicon, and the selection of transcription data. In both cases, we want to guarantee that a particular distribution of n-grams is achieved. In the case of lexicon design, we want to ascertain that phones will be present often enough. In the case of train- ing data selection for scarcely resourced languages, we want to make sure that some n-grams are better represented than others. Our proposed technique yields encouraging results. Index Terms: acoustic model training, lexical model, maxi- mum entropy, Kullback-Leibler divergence, training data selec- tion 1. Introduction Data selection affects accuracy of automatic speech recognition (ASR) systems of varying degrees of complexity. An ASR sys- tem’s accuracy is affected by how well its knowledge sources can model the unseen test data. The knowledge sources, i.e. the acoustic, pronunciation, and language models, are trained from data that, if not carefully selected, will not have enough samples of important “events”. This lack of relevant data, in turn, results in models that perform poorly. In the case of acoustic models, data selection affects large and small systems. Designers of large vocabulary continu- ous speech recognition (LVCSR) systems in resource-rich lan- guages normally use all available data. However, the triphone occurrences differ sharply between text corpora [1]. Therefore, training models with all available data may result in acoustic models that are far from representative of a domain poorly rep- resented in the training data. Designers of ASR systems in resource-scarce languages, on the other hand, need to be careful about how to pick the data used for training from the small amount available. As Barnard [2] points out: “The optimal data distribution (for ASR training) is not exactly the same as the natural data distribution of phones / triphones etc. This makes intuitive sense: highly frequent units do not add much to accuracy after a certain point, and very rare units have little impact on test scores – so it is the middle range that needs to be boosted.” Consider now the design of the pronunciation lexicon, from now on referred to simply as the dictionary. In a scenario where the dictionary is built for a target language, but includes words borrowed from other languages, it makes sense, from a human point of view, that the pronunciations using the phone set of the target language appear first. But if the dictionary also includes a pronunciation using a phone set from the original language (in case the system has to handle speakers who know how to pronounce the word in the original language), then the phones from languages other than the target language will never appear as the first alternate pronunciation in the dictionary. ASR systems’ trainers start by linearly assigning audio frames to Hidden Markov Model (HMM) states. This initial assignment is used by the system to estimate non-flat (non uni- form) HMMs, which, by iteration, become increasingly more specific and accurate. For the initial segmentation, however, the trainers have no information that would allow it to choose one particular pronunciation of a word rather than other, in cases where a word has multiple pronunciations. Trainers normally choose the first alternate for this initial segmentation. If a phone never appears in the first alternate pronunciation of a word, as described in the scenario above, the trainer will never see it, and the model for that phone will not be initialized. This is a seemingly simple but practical problem that normally causes the trainer to break in subsequent steps. Data selection can also be applied to the language model. Using all available data is not always the best approach [3], as relevant information changes from one domain to another. In this work, we do not examine data selection for language models, but we report on selection for acoustic model training and dictionary organization. The theoretical background for our method is presented in Section 2. In Section 3 we report on our work on automatically reorganizing a dictionary. In Section 4 we present our work on data selection for acoustic model train- ing. We conclude in Section 5. 2. Background and Previous Work We tackle two problems that, although seemingly different, can both be described as selecting data conforming to a predefined phone distribution, or, more generally, n-gram distributions. The first problem concerns organization of a dictionary con- taining words with multiple pronunciations. The initial step of the trainer uses the first of these alternates, referred to as the first pronunciation in the remainder of this paper. Our goal is to reorder the alternate pronunciations associated with each word in a way that every phoneme in the dictionary appears at least once in a first pronunciation. Let’s consider the phone distribu- tion PF of the phones appearing in the first pronunciation only. Our goal is then to reorganize the dictionary, i.e. reorder the alternate pronunciations, so less frequent phones in PF are fa- vored. In other words, we want PF to approach the uniform Copyright 2011 ISCA 28 - 31 August 2011, Florence, Italy INTERSPEECH 2011 2297