AUTOMATIC PROSODIC LABELING OF 6 LANGUAGES Halewijn Vereecken 1 , Jean-Pierre Martens 1 , Cynthia Grover 2 , Justin Fackrell 2 and Bert Van Coile 1,2 1 ELIS, University of Ghent, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium 2 Lernout & Hauspie Speech Products NV, Sint-Krispijnstraat 7, B-8900 Ieper, Belgium ABSTRACT This contribution describes a method for the automatic prosodic labeling of multi-lingual speech data. The prosodic labels are word boundary strength and word prominence. The speech sig- nal and its orthographic representation are first transformed to feature vectors comprising acoustic and linguistic features such as pitch, duration, energy, part-of-speech, punctuation, word frequency and stress. Next, the feature vectors are mapped to prosodic labels via a cascade of multi-layer perceptrons. Experi- ments on 6 different languages demonstrate that combining acoustic with linguistic features yields a better performance than obtainable on the basis of acoustic features alone. 1. INTRODUCTION It is well known that high quality speech synthesis can only be achieved by incorporating accurate prosodic models to detect prosodic phrase structure, to identify phrasal prominence and to determine phoneme durations. The ultimate goal of a prosody module is to improve the naturalness and, to a lesser extent, the intelligibility of synthesized speech. The prosodic models are often derived from large speech databases which are labeled both at a phonetic and a prosodic level. As manual labeling suffers from some major drawbacks, we aim to use automati- cally labeled databases for that purpose. In this paper we will deal with the automatic prosodic labeling of multi-lingual speech data. The automatic phonetic segmentation and labeling (annotation) is dealt with elsewhere [7, 8]. The prosodic events we are concerned with are prosodic phras- ing and phrasal prominence. Prosodic phrasing refers to the grouping or separating of words within a sequence of spoken words, and phrasal prominence refers to the relative importance of the words in a prosodic phrase. Following the findings of Portele et al [4], and of many others before them (see [4] for an overview), it was established that both phrasing and prominence are gradual phenomena. The disjuncture or coherence between two words is expressed by means of a prosodic boundary strength (PBS) between 0 and 3: 0 refers to ordinary word boundaries, and values 1, 2 and 3 refer to weak, intermediate and strong boundaries respectively [2]. Phrasal prominence is labeled by assigning to each word a prominence (PROM) value between 0 and 9, with 0 being used for words which are not at all prominent and 9 being used for most prominent words. In the next section we will review two successful approaches to automatic prosodic labeling that have been reported in the litera- ture. Our system, described in section 3, was inspired by these efforts. Basically, the speech signal and its orthography are mapped to a series of acoustic and linguistic features, which are then mapped to prosodic labels using multi-layer perceptrons (MLPs). The acoustic features include pitch, duration and energy on various levels; the linguistic ones are part-of-speech labels, punctuation, word frequency, etc. In section 4, we demonstrate that the linguistic prosodic features are to some extent complementary to the acoustic ones, especially for word prominence. We also show that the prosodic labeling perform- ance is better when the phonetic annotation was done manually, but that the degradation obtained by using an automatic annota- tion remains sufficiently small. 2. SOME EXISTING SYSTEMS Often, automatic prosodic labeling is viewed as a standard recognition problem involving first feature extraction and then classification. The feature vector extraction maps the speech signal and its orthography to a time sequence of feature vectors that are ideally good discriminators of prosodic classes. The goal of the classification component is to map the sequence of feature vectors to a sequence of prosodic labels. If some kind of lan- guage model describing acceptable prosodic label sequences is included, an optimization technique like Viterbi decoding is used for finding the most likely prosodic label sequence. This idea is elaborated thoroughly by Wightman and Ostendorf [9]. Intonational labeling is performed at the syllable level, with each syllable being marked as either prominent, carrying a boundary tone, both prominent and carrying a boundary tone, or neither prominent nor carrying a boundary tone. In addition, word boundaries are labeled with a 7-scale break index (break index labeling). In essence, feature vectors are mapped to poste- rior probabilities via decision trees, and these are combined with a Markov model of the prosodic label language. The feature vectors in [9] comprise continuously-valued duration, pitch and energy measures, and some categorical features such as a flag indicating whether or not the word was followed by a breath. The success of the above approach was further demonstrated in the framework of the German VERBMOBIL project (see e.g. [1]). The scope there was to study different reference labels (syntactic-prosodic labels obtained automatically during text generation, hand-marked syntactic-prosodic labels, or the more perceptual prosodic labels), different feature vectors, different classes to distinguish (e.g. combinations of boundary labels, combinations of accent labels, and combinations of boundary and accent labels), different classifiers (MLPs, Gaussian distri- bution classifiers, polynomial classifiers), as well as different language models (e.g. a 5-gram language model of the ortho- graphic word chain separated by boundary labels). Each feature vector was composed of a large number of acoustic features (du- ration, pitch, energy) and a few simple linguistic features such as a flag indicating whether or not a syllable carries primary lexical stress. Syntactic/semantic features, if used at all, were mostly used in combination with the output of the classifiers.