ITERATIVE FILTERING OF PHONETIC TRANSCRIPTIONS OF PROPER NOUNS
Antoine Laurent
†§
, Teva Merlin
†
, Sylvain Meignier
†
, Yannick Est` eve
†
, Paul Del´ eglise
†
†
LIUM (Computer Science Research Center – Universit´ e du Maine) – Le Mans, France
§
Sp´ ecinov – Tr´ elaz´ e, France
first.last@lium.univ-lemans.fr, a.laurent@specinov.fr
ABSTRACT
This paper focuses on an approach to enhancing automatic phonetic
transcription of proper nouns by using an iterative filter to retain only
the most relevant part of a large set of phonetic variants, obtained by
combining rule-based generation with extraction from actual audio
signals. Using this technique, we were able to reduce the error rate
affecting proper nouns during automatic speech transcription of the
ESTER corpus of French broadcast news. The role of the filtering
was to ensure that the new phonetic variants of proper nouns would
not induce new errors in the transcription of the rest of the words.
Index Terms— Speech recognition, Phonetic transcription,
Proper nouns
1. INTRODUCTION
This work focuses on an approach to enhancing automatic phonetic
transcription of proper nouns.
Proper nouns constitute a special case when it comes to phonetic
transcription (at least in French, which was the language used for this
study). Indeed, there is much less predictability in how proper nouns
may be pronounced than for regular words. This is partly due to the
fact that, in French, pronunciation rules are much less normalized for
proper nouns than for other categories of words: a given sequence
of letters is not guaranted to be pronounced the same way in two
different proper nouns.
The lack of predictability also finds its roots in the wide array
of origins proper nouns can be from: the more foreign the origin,
the less predictable the pronunciation, with variations covering the
whole range from the correct pronunciation in the original language
to a Frenchified interpretation of the spelling.
The high variability induced by this low predictability is a source
of difficulty for automatic speech recognition (ASR) systems when
they have to deal with proper nouns. For an ASR system, being
confronted with a proper noun pronounced using a phonetic variant
very remote from any variant present in its dictionary is a situation
similar to encountering an unknown word, if the language model
cannot compensate for the acoustic gap. Such errors can have a
strong impact on the word error rate (WER): according to [1], the
recognition error on an out-of-vocabulary word propagates through
the language model to the surrounding words, causing a WER of
about 50 % within a window of 5 words to the left and to the right
(again, in French). This highlights that the influence of the quality
of the phonetic dictionary of proper nouns extends farther than just
the recognition of proper nouns themselves. It is particularly true in
the case of applications where proper nouns are frequently encoun-
tered, such as transcription of broadcast news. However, aside from
its potential impact on WER, accurate recognition of proper nouns
can also be very important—independently from the frequency of
their occurence—in other contexts such as in the case of automatic
indexing of multimedia documents, or transcription of meetings.
Two common approaches to the problem of automatic phonetic
transcription were proposed in the literature: the rule-based ap-
proach [2], and the statistic-based approach, including classification
trees [3] and HMM-decoding-based methods [4, 5]. For the specific
case of proper nouns, a study on dynamic generation of plausible
distortions of canonical forms of proper nouns was proposed in [6].
We propose a method to build a dictionary of phonetic transcrip-
tions of proper nouns by using an iterative filter to retain the most
relevant part of a large set of phonetic variants, obtained by combin-
ing rule-based generation with extraction from actual audio signals.
Rule-based generation of phonetic transcriptions is used to ensure
that the most “common-sense” pronunciation variants are taken into
account. It is combined with automatic extraction of phonetic vari-
ants from manually-annotated audio signals to enrich the set of tran-
scriptions with those less predictable variants which actual people
use. The iterative filter is then applied in order to reduce noise by in-
validating the variants that are deemed irrelevant because too rarely
used, and the ones that are found to be too prone to generate confu-
sion with other words.
The intermediate (before filtering) and final sets of phonetic
transcriptions were evaluated in terms of Word Error Rate (WER)
and Proper Noun Error Rate (PNER), computed over the corpus of
French broadcast news from the ESTER evaluation campaign [7].
First, we will present advantages and drawbacks of the genera-
tion and extraction methods. Next, we will explain how we combine
them with the iterative filtering. Finally our results will be presented
and commented on.
2. RULE-BASED GENERATION OF PHONETIC
TRANSCRIPTIONS
A rule-based phonetic transcription system relies exclusively on the
spelling of words to generate the possible corresponding chains of
phones. It offers the advantage of providing phonetic variants even
for words for which no speech signal is available. In the case of
propers nouns, it serves to generate the most “common-sense” vari-
ants, i.e. the ones which people would use when they have no prior
knowledge of the pronunciation of a particular proprer noun.
The rule-based generator we used was LIA PHON [2]. Dur-
ing the ARC B3 evaluation campaign of French automatic phonetiz-
ers, 99.3 % of the phonetic transcriptions generated by LIA PHON
were correct. However, [2] reveals that transcription errors were not
distributed evenly among the various classes of words: erroneous
transcription of proper nouns represented 25.6 % of the errors even
though proper nouns only represented 5.8 % of the test corpus, re-
flecting poorer performance by LIA PHON on this class of words.
4265 978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009