STOCHASTIC PRONUNCIATION MODELLING FROM HAND-LABELLED PHONETIC CORPORA M. Riley , W. Byrne , M. Finke , S. Khudanpur , A. Ljolje , J. McDonough , H.Nock , M. Saraclar , C. Wooters , G. Zavaliagkos AT&T Labs – Research, Florham Park, NJ, USA Johns Hopkins University, Baltimore, MD, USA Carnegie-Mellon University, Pittsburgh, PA, USA Cambridge University Engineering Department, Cambridge, UK U.S. Department of Defense, Fort Meade, MD, USA BBN, Cambridge, MA, USA ABSTRACT In the early ’90s, the availability of the TIMIT read-speech phon- etically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summar- ized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speechphonetically tran- scribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontan- eous telephone speech ASR task. We describe several approaches taken there. Theseinclude ( ) one analogousto the AT&T approach, ( ) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words (‘mul- tiwords’) in the corpus (with corpus-derived probabilities) into the ASR lexicon, and ( ) a hybrid approachin which a decision-tree model was used to automatically phonetically transcribe a much larger speech corpus than ICSI and then the multiword approach was used to construct an ASR recognition pronunciation lexicon. 1. INTRODUCTION Most speech recognition systems rely on pronouncing dictionaries that contain few alternate pronunciationsfor most words. In natural speech, however, words seldom adhere to their citation forms. The failure of ASR systems to capture this important source of variabil- ity is potentially a significant source for recognition errors, partic- ularly in spontaneous, conversational speech. We report methods used to address this issue applied to read speech at AT&T [9] and to spontaneous speech at and after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July- August, 1997 [2]. As a first step towards alleviating this common limitation of pronouncing dictionaries, we identify a systematic way of gener- ating alternate pronunciations of words by using phonetically la- belled portions of the TIMIT [5] and Switchboard [6] corpora. One viewpoint we explore is that pronunciation variability may be mod- elled by a statistical mapping from canonical pronunciations (base- forms) to symbolic surface forms, and we use decision trees to cap- ture this mapping. A second way we exploit the hand transcrip- tions is by enhancing the dictionary using frequently seen pronun- ciations. While the former has the potential to generalize to un- seen words and pronunciations, the latter is more conservative and hence potentially more robust. As many researchers have observed earlier, simply adding sev- eral alternate pronunciations to the dictionary increases the confus- ability of words to the extent that the gains from having them are of- ten more than nullified. We address this problem in two ways. We assign costs to alternate pronunciations so that, e.g., if a frequent pronunciation of “cause” and an infrequent pronunciation of “be- cause” are identical, a penalty is incurred to attribute the pronunci- ation to “because” rather than “cause.” In addition, we account for context effects so that, e.g., “to” is allowed the pronunciation [ax], which is a frequent pronunciation of “a,” only if “to” is preceded by “going,” as in [g aa n ax]. Our pronunciation modelling efforts may be divided into two broad categories. In our tree based dictionary expansion experi- ments, we apply decision tree based pronunciationmodels to entries in our baseform dictionary to obtain alternate pronunciations, which are then used in testing. In our explicit dictionary expansion ex- periments, we apply the decision tree based pronunciation models first to the training corpus, and perform a forced alignment with the acoustic models to “choose”amongstthe alternatives. The diction- ary is then explicitly augmented with novel pronunciations which occur sufficiently often. The tree based expansion implicitly adds many more newpronunciations than the explicit expansion. However, it does not attempt to model any cross-word coarticulation. The ex- plicit expansion does so by allowing as dictionary entries a select set (cf. [4]) of multiwords – word pairs and triples. We demonstrate in Sections 2 and 3 that the tree-based method gives a reduction in the word error rate (WER) for the read-speech North American Business (NAB) News task while both methods give reductions for the conversationaltelephonespeechSwitchboard task over baseline systems using only a citation-form dictionary. Further, we show in Sections 4 and 5 that reductions persist when the baselinesystems are improved by coarticulation sensitive acous- tic modelling and improved language modelling.