AUTOMATIC DETECTION AND CORRECTION OF SYNTAX-BASED PROSODY ANNOTATION ERRORS Sandrine Brognaux 1 , Thomas Drugman 2 , Richard Beaufort 3 1 CENTAL and ICTEAM - Universit´ e catholique de Louvain, Belgium 2 TCTS Lab - Universit´ e de Mons, Belgium 3 Nuance Communications, Inc. * ABSTRACT Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, con- sidering the prosodic nature of each phoneme of the corpus is cru- cial. Generally, phonemes are assigned labels which should re- ﬂect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic re- alization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modiﬁcations. The proposed technique has the advantage of not requiring a manu- ally prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected. Index Terms: Prosody, Speech Synthesis, Annotation, Corpus 1. INTRODUCTION Large speech corpora play a key role in both unit-selection and HMM-based speech synthesis. These corpora have to be annotated to provide information about the nature of each unit, usually phonemes or diphones. Information includes the position of these latter in the current sentence, word or syllable but also the part of speech of the carrier word, the structure of the carrier syllable, etc. The suprasegmental realization (duration, fundamental frequency and energy) of the unit has also to be considered. Several strategies have been proposed for this purpose. The most straightforward way is to label each unit with its exact values of fundamental frequency (F0), duration and energy. This however implies that precise values should also be predicted for each phoneme of any new sentence to synthetize. This prediction is particularly difﬁcult to make as it should correspond to a natural realization and be close to existing units in the database. Metrics should also be developed to compute a distance between prediction values and values available in the database, taking into account the relative relevance of each acoustic parameter, which is especially challenging. To alleviate this problem, various annotation schemes have been proposed. In [1], it is proposed to automatically cluster similar units. * The study was carried out while Richard Beaufort was still working at the CENTAL (Universit´ e catholique de Louvain, Belgium) These clusters can be accessed with a decision tree based on linguis- tic and acoustic criteria. Metrics to deﬁne the distance between tar- get and database unit are easily computed as the distance between the unit and the centroid of its node. At synthesis time, however, some acoustic values still need to be predicted to browse the tree. In order to avoid that prediction, the use of symbolic information like ‘tones’, i.e. prosodic labels, has been presented by Campbell [2]. Various acoustic values are gathered into a same tone, that should reﬂect a general prosodic realization. This reduces the number of possible values to predict, while smoothing acoustic variations that might be related to a similar prosodic function. The tones are usually pre- dicted on the basis of the syntactic structure of the sentence, as de- scribed below. In that respect, several symbolic labeling schemes can be exploited. Mertens’ tones [3] indicate phrase-boundaries while ToBI [4] also assigns labels for prominence. Phrases can be deﬁned as syntactic groups which could be isolated by means of pauses. Conversely to [2], other speech synthesizers like the LiONs sys- tem presented in [5] do not rely on any prosodic label. This latter approach makes use of pure linguistic context only, like the position of the word in the sentence and its part of speech. It allows avoid- ing errors made by the tone prediction and assumes that syntactic parameters are enough to characterize the prosodic realization. One of the goals of both [2] and [5] is to identify the location of phrase-end boundaries because they are often associated with ma- jor prosodic movements. To determine their position, punctuation is very helpful. However, phrases are not always delimited by punc- tuation marks. Most systems then rely on a heuristic segmentation called “chinks & chunks” [6] to identify phrases. It is based on a very simple rule which roughly assumes that there should be a boundary between a group of content words (chunks) and a group of function words (chinks). Here is an example of the application of this rule: [There are several important changes][in the way][the quantiﬁer rules][will work][for the remainder][of the course]. In this example, the last syllable of changes, way, rules, work, remainder and course will be assigned a speciﬁc prosodic label which should correspond to a speciﬁc prosodic realization. The problem is that the application of this rule without any veriﬁcation of the acoustic realization in the corpus is prone to errors. A ﬁrst issue is that the simplistic principle of the “chinks & chunks” method does not allow it to take long distance de- pendencies, semantic information or speciﬁc prosodic rules into account. Among these rules is the “stress collision” [7], i.e. the fact that two consecutive stresses can only occur if they are divided by a boundary of higher rank. This boundary is usually marked by a melodic movement and syllabic lengthening and, possibly, the insertion of a pause between both stresses. The “chinks &