ACOUSTIC-SYNTACTIC MAXIMUM ENTROPY MODEL FOR AUTOMATIC PROSODY LABELING Vivek Rangarajan, Shrikanth Narayanan Speech Analysis and Interpretation Laboratory University of Southern California Viterbi School of Electrical Engineering vrangara@usc.edu, shri@sipi.usc.edu Srinivas Bangalore AT&T Research Labs 180 Park Avenue Florham Park, NJ 07932, U.S.A. srini@research.att.com ABSTRACT In this paper we describe an automatic prosody labeling framework that exploits both language and speech information intended for the purpose of incorporating prosody within a speech-to-speech translation framework. We propose a maximum entropy syntactic- prosodic model that achieves an accuracy of 85.22% and 91.54% for pitch accent and boundary tone labeling on the Boston Univer- sity Radio News corpus. We model the acoustic-prosodic stream with two different models, one a maximum entropy model and the other a traditional HMM. We ﬁnally couple the syntactic-prosodic and acoustic-prosodic components to achieve a pitch accent and boundary tone classiﬁcation accuracy of 86.01% and 93.09% re- spectively. 1. INTRODUCTION Prosody refers to the rhythm and intonation patterns of spoken language that convey meaningful information beyond the ortho- graphic transcription. In this sense they are also referred to as suprasegmentals [1], that convey both linguistic and paralinguis- tic information like emphasis, intent, attitude and emotion of a speaker. Acoustic correlates of duration, intensity and pitch like syllable duration, short time energy and fundamental frequency (f0) are perceived to confer prosodic prominence or stress in Eng- lish. However, these prosodic cues cannot be quantiﬁed in an ab- solute manner and are highly relative to individual speaker style, gender, dialect and other phonological factors. The difﬁculty in re- liably characterizing suprasegmental information present in speech signal has resulted in prosodic labeling standards like ToBI for American English [2]. Automatic recognition and identiﬁcation of prosodic events is vital in text-to-speech (TTS) synthesis [3], speech understanding [4], speech recognition [5] and speech-to-speech translation [6, 7] applications. While automatic prosody labeling has been actively pursued over the last several years (see Sec. 2), one source of re- newed interest has come from recent spoken language translation applications. The work described in this paper is motivated by the desire to incorporate prosody within a speech-to-speech transla- tion framework. Typically, state-of-the-art speech translation sys- tems have a source language recognizer followed by a translator. The translated text is then synthesized in the target language with prosody predicted from text. In this process, the prosodic informa- tion present in the source signal is lost during translation. How- ever, with reliable prosody labeling in the source language, the prosody can be transferred to the target language (e.g., English-to- Spanish) and the predicted prosody can used by a TTS system in synthesizing speech with appropriate prosody. A pre-requisite for such applications is the accurate prosody labeling, the topic of the present work. In this paper, we describe the ﬁrst phase of our work that en- tails building an automatic prosody labeler for the source language (English in our case). We use the Boston University (BU) Radio Speech Corpus [8], one of several publicly available speech cor- pora with manual ToBI annotations intended for experiments in au- tomatic prosody labeling. We condition prosody not only on word strings and their parts-of-speech but also on richer syntactic infor- mation encapsulated in the form of Supertags [9]. We propose a maximum entropy modeling framework for the syntactic features. We model the acoustic-prosodic stream with two different models, a maximum entropy model and a more traditional hidden markov model (HMM). In an automatic prosody labeling task, one is es- sentially trying to predict the correct prosody label sequence for a given utterance and a maximum entropy model offers an elegant solution to this learning problem. The framework is also robust in the selection of discriminative features for the classiﬁcation prob- lem. So, given a word sequence W = {w1, ··· ,wn} and a set of acoustic-prosodic features A = {o1, ··· ,oT }, the best prosodic label sequence L ∗ = {l1,l2, ··· ,ln} is obtained as follows, L ∗ = arg max L P (L|A, W ) (1) = arg max L P (L|W ).P (A|L, W ) (2) ≈ arg max L P (L|Φ(W )).P (A|L, W ) (3) where Φ(W ) is the syntactic feature encoding of the word se- quence W . The ﬁrst term in Equation (3) corresponds to the prob- ability obtained through our maximum entropy syntactic model. The second term in Equation (3) corresponds to the probability of the acoustic data stream which is assumed to be dependent only on the prosodic label sequence obtained through a HMM. The paper is organized as follows. In section 2 we describe related work in automatic prosody labeling followed by a descrip- tion of the data used in our experiments in section 3. We present prosody prediction results from off-the-shelf synthesizers in sec- tion 4. Section 5 details our proposed maximum entropy syntactic- prosodic model for prosody labeling. In section 6, we describe our acoustic-prosodic model and conclude in section 7 with directions for future work. 2. RELATED WORK Automatic prosody labeling has been an active research topic for over a decade. Wightman and Ostendorf [4] developed a decision- 74 1424408733/06/$20.00 ©2006 IEEE SLT 2006