ITERATIVE FILTERING OF PHONETIC TRANSCRIPTIONS OF PROPER NOUNS Antoine Laurent †§ , Teva Merlin † , Sylvain Meignier † , Yannick Est` eve † , Paul Del´ eglise † † LIUM (Computer Science Research Center – Universit´ e du Maine) – Le Mans, France § Sp´ ecinov – Tr´ elaz´ e, France ﬁrst.last@lium.univ-lemans.fr, a.laurent@specinov.fr ABSTRACT This paper focuses on an approach to enhancing automatic phonetic transcription of proper nouns by using an iterative ﬁlter to retain only the most relevant part of a large set of phonetic variants, obtained by combining rule-based generation with extraction from actual audio signals. Using this technique, we were able to reduce the error rate affecting proper nouns during automatic speech transcription of the ESTER corpus of French broadcast news. The role of the ﬁltering was to ensure that the new phonetic variants of proper nouns would not induce new errors in the transcription of the rest of the words. Index Terms— Speech recognition, Phonetic transcription, Proper nouns 1. INTRODUCTION This work focuses on an approach to enhancing automatic phonetic transcription of proper nouns. Proper nouns constitute a special case when it comes to phonetic transcription (at least in French, which was the language used for this study). Indeed, there is much less predictability in how proper nouns may be pronounced than for regular words. This is partly due to the fact that, in French, pronunciation rules are much less normalized for proper nouns than for other categories of words: a given sequence of letters is not guaranted to be pronounced the same way in two different proper nouns. The lack of predictability also ﬁnds its roots in the wide array of origins proper nouns can be from: the more foreign the origin, the less predictable the pronunciation, with variations covering the whole range from the correct pronunciation in the original language to a Frenchiﬁed interpretation of the spelling. The high variability induced by this low predictability is a source of difﬁculty for automatic speech recognition (ASR) systems when they have to deal with proper nouns. For an ASR system, being confronted with a proper noun pronounced using a phonetic variant very remote from any variant present in its dictionary is a situation similar to encountering an unknown word, if the language model cannot compensate for the acoustic gap. Such errors can have a strong impact on the word error rate (WER): according to [1], the recognition error on an out-of-vocabulary word propagates through the language model to the surrounding words, causing a WER of about 50 % within a window of 5 words to the left and to the right (again, in French). This highlights that the inﬂuence of the quality of the phonetic dictionary of proper nouns extends farther than just the recognition of proper nouns themselves. It is particularly true in the case of applications where proper nouns are frequently encoun- tered, such as transcription of broadcast news. However, aside from its potential impact on WER, accurate recognition of proper nouns can also be very important—independently from the frequency of their occurence—in other contexts such as in the case of automatic indexing of multimedia documents, or transcription of meetings. Two common approaches to the problem of automatic phonetic transcription were proposed in the literature: the rule-based ap- proach [2], and the statistic-based approach, including classiﬁcation trees [3] and HMM-decoding-based methods [4, 5]. For the speciﬁc case of proper nouns, a study on dynamic generation of plausible distortions of canonical forms of proper nouns was proposed in [6]. We propose a method to build a dictionary of phonetic transcrip- tions of proper nouns by using an iterative ﬁlter to retain the most relevant part of a large set of phonetic variants, obtained by combin- ing rule-based generation with extraction from actual audio signals. Rule-based generation of phonetic transcriptions is used to ensure that the most “common-sense” pronunciation variants are taken into account. It is combined with automatic extraction of phonetic vari- ants from manually-annotated audio signals to enrich the set of tran- scriptions with those less predictable variants which actual people use. The iterative ﬁlter is then applied in order to reduce noise by in- validating the variants that are deemed irrelevant because too rarely used, and the ones that are found to be too prone to generate confu- sion with other words. The intermediate (before ﬁltering) and ﬁnal sets of phonetic transcriptions were evaluated in terms of Word Error Rate (WER) and Proper Noun Error Rate (PNER), computed over the corpus of French broadcast news from the ESTER evaluation campaign [7]. First, we will present advantages and drawbacks of the genera- tion and extraction methods. Next, we will explain how we combine them with the iterative ﬁltering. Finally our results will be presented and commented on. 2. RULE-BASED GENERATION OF PHONETIC TRANSCRIPTIONS A rule-based phonetic transcription system relies exclusively on the spelling of words to generate the possible corresponding chains of phones. It offers the advantage of providing phonetic variants even for words for which no speech signal is available. In the case of propers nouns, it serves to generate the most “common-sense” vari- ants, i.e. the ones which people would use when they have no prior knowledge of the pronunciation of a particular proprer noun. The rule-based generator we used was LIA PHON [2]. Dur- ing the ARC B3 evaluation campaign of French automatic phonetiz- ers, 99.3 % of the phonetic transcriptions generated by LIA PHON were correct. However, [2] reveals that transcription errors were not distributed evenly among the various classes of words: erroneous transcription of proper nouns represented 25.6 % of the errors even though proper nouns only represented 5.8 % of the test corpus, re- ﬂecting poorer performance by LIA PHON on this class of words. 4265 978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009