ACOUSTIC-SYNTACTIC MAXIMUM ENTROPY MODEL FOR AUTOMATIC PROSODY
LABELING
Vivek Rangarajan, Shrikanth Narayanan
Speech Analysis and Interpretation Laboratory
University of Southern California
Viterbi School of Electrical Engineering
vrangara@usc.edu, shri@sipi.usc.edu
Srinivas Bangalore
AT&T Research Labs
180 Park Avenue
Florham Park, NJ 07932, U.S.A.
srini@research.att.com
ABSTRACT
In this paper we describe an automatic prosody labeling framework
that exploits both language and speech information intended for
the purpose of incorporating prosody within a speech-to-speech
translation framework. We propose a maximum entropy syntactic-
prosodic model that achieves an accuracy of 85.22% and 91.54%
for pitch accent and boundary tone labeling on the Boston Univer-
sity Radio News corpus. We model the acoustic-prosodic stream
with two different models, one a maximum entropy model and the
other a traditional HMM. We finally couple the syntactic-prosodic
and acoustic-prosodic components to achieve a pitch accent and
boundary tone classification accuracy of 86.01% and 93.09% re-
spectively.
1. INTRODUCTION
Prosody refers to the rhythm and intonation patterns of spoken
language that convey meaningful information beyond the ortho-
graphic transcription. In this sense they are also referred to as
suprasegmentals [1], that convey both linguistic and paralinguis-
tic information like emphasis, intent, attitude and emotion of a
speaker. Acoustic correlates of duration, intensity and pitch like
syllable duration, short time energy and fundamental frequency
(f0) are perceived to confer prosodic prominence or stress in Eng-
lish. However, these prosodic cues cannot be quantified in an ab-
solute manner and are highly relative to individual speaker style,
gender, dialect and other phonological factors. The difficulty in re-
liably characterizing suprasegmental information present in speech
signal has resulted in prosodic labeling standards like ToBI for
American English [2].
Automatic recognition and identification of prosodic events is
vital in text-to-speech (TTS) synthesis [3], speech understanding
[4], speech recognition [5] and speech-to-speech translation [6, 7]
applications. While automatic prosody labeling has been actively
pursued over the last several years (see Sec. 2), one source of re-
newed interest has come from recent spoken language translation
applications. The work described in this paper is motivated by the
desire to incorporate prosody within a speech-to-speech transla-
tion framework. Typically, state-of-the-art speech translation sys-
tems have a source language recognizer followed by a translator.
The translated text is then synthesized in the target language with
prosody predicted from text. In this process, the prosodic informa-
tion present in the source signal is lost during translation. How-
ever, with reliable prosody labeling in the source language, the
prosody can be transferred to the target language (e.g., English-to-
Spanish) and the predicted prosody can used by a TTS system in
synthesizing speech with appropriate prosody. A pre-requisite for
such applications is the accurate prosody labeling, the topic of the
present work.
In this paper, we describe the first phase of our work that en-
tails building an automatic prosody labeler for the source language
(English in our case). We use the Boston University (BU) Radio
Speech Corpus [8], one of several publicly available speech cor-
pora with manual ToBI annotations intended for experiments in au-
tomatic prosody labeling. We condition prosody not only on word
strings and their parts-of-speech but also on richer syntactic infor-
mation encapsulated in the form of Supertags [9]. We propose a
maximum entropy modeling framework for the syntactic features.
We model the acoustic-prosodic stream with two different models,
a maximum entropy model and a more traditional hidden markov
model (HMM). In an automatic prosody labeling task, one is es-
sentially trying to predict the correct prosody label sequence for a
given utterance and a maximum entropy model offers an elegant
solution to this learning problem. The framework is also robust in
the selection of discriminative features for the classification prob-
lem. So, given a word sequence W = {w1, ··· ,wn} and a set of
acoustic-prosodic features A = {o1, ··· ,oT }, the best prosodic
label sequence L
∗
= {l1,l2, ··· ,ln} is obtained as follows,
L
∗
= arg max
L
P (L|A, W ) (1)
= arg max
L
P (L|W ).P (A|L, W ) (2)
≈ arg max
L
P (L|Φ(W )).P (A|L, W ) (3)
where Φ(W ) is the syntactic feature encoding of the word se-
quence W . The first term in Equation (3) corresponds to the prob-
ability obtained through our maximum entropy syntactic model.
The second term in Equation (3) corresponds to the probability of
the acoustic data stream which is assumed to be dependent only on
the prosodic label sequence obtained through a HMM.
The paper is organized as follows. In section 2 we describe
related work in automatic prosody labeling followed by a descrip-
tion of the data used in our experiments in section 3. We present
prosody prediction results from off-the-shelf synthesizers in sec-
tion 4. Section 5 details our proposed maximum entropy syntactic-
prosodic model for prosody labeling. In section 6, we describe our
acoustic-prosodic model and conclude in section 7 with directions
for future work.
2. RELATED WORK
Automatic prosody labeling has been an active research topic for
over a decade. Wightman and Ostendorf [4] developed a decision-
74 1424408733/06/$20.00 ©2006 IEEE SLT 2006