PROSODIC PROMINENCE DETECTION IN SPEECH
Fabio Tamburini
CILTA/DEIS - University of Bologna - Italy
f.tamburini@cilta.unibo.it
ABSTRACT
This paper presents work in progress on the automatic
detection of prosodic prominence in continuous speech.
Prosodic prominence involves two different phonetic
features: pitch accents, connected with fundamental
frequency (F0) movements and syllable overall energy,
and stress, which exhibits a strong correlation with
syllable nuclei duration and high-frequency emphasis. By
measuring these acoustic parameters it is possible to build
an automatic system capable of correctly identifying
prominent syllables with an agreement with human-tagged
data comparable with the inter-human agreement reported
in the literature. These results were achieved without using
any information apart from acoustic parameters.
1 INTRODUCTION
The study of prosodic phenomena in speech is a central
topic in language investigation. Speakers tend to focus the
listener's attention on the most important parts of the
message, marking them by means of such phenomena. As
outlined in Beckman & Venditti [4], a precise
identification of such phenomena helps to disambiguate
the meaning of some utterances. It is also a fundamental
step for the automatic recognition of spontaneous speech,
and enhances the fluency and adequacy of automatic
speech-generation systems. Moreover the construction of
large annotated language resources, such as prosodically
tagged speech corpora, is of increasing interest both for
research purposes and for language teaching.
One of the most important prosodic features is
prominence: a word or part of a word made prominent is
perceived as standing out from its environment [23]. A
better understanding of how prominence is physically
accomplished is a basic step in the construction of tools
capable of automatically identifying such phenomena.
This paper presents work in progress on the
construction of a system for the automatic detection of
prosodic prominence features in speech using only
acoustic/phonetic parameters and cues.
Following Beckman's [3] phonological view, further
developed by Bagshaw [1, 2], syllables that are perceived
as prominent either contain a pitch accent or are somehow
"stressed". On the acoustic/phonetic side, the
accomplishment of such features has to be strictly
correlated with acoustic parameters. As well as the works
already cited, there are many studies [15, 16, 17],
suggesting that some of the main acoustic correlates of
prominence are pitch movements (strictly connected with
fundamental frequency - F0), overall syllable energy,
syllable duration and spectral emphasis.
The work presented here is divided into two separate
steps: the first step involves the automatic identification of
syllable-nuclei boundaries to reliably measure the duration
feature, while the second one concerns the identification of
prominent syllables by means of acoustic measurements.
This paper will report on the first experiments conducted
on the whole system.
The data set used in these experiments is a subset of the
DARPA/TIMIT acoustic-phonetic continuous speech
corpus, consisting of thousands of transcribed, phone-
segmented and aligned sentences of American English. In
this study the TIMIT annotations are used only for
measuring the system performances, not for prominence
detection.
Several studies have been conducted in this field for
building automatic systems capable of reliably identifying
either one acoustic correlate of prominence [5, 7] or a
complete set of prosodic parameters [2, 6, 24]. These latter
studies, involved in the construction of a complete prosody
identification system, rely on additional phonetic
information such as phone labelling and/or utterance
transcriptions.
Despite the quantity and quality of studies on this
topic, it seems that the automatic and reliable detection of
prosodic prominence, without providing phonetic
information, is still an open question.
2 THE ACOUSTIC PARAMETERS
In the following subsections, each acoustic parameter
involved in this study is considered. All acoustic
parameters must be normalised to some extent to avoid the
natural variations among different speakers. The specific
normalisation procedures applied to each parameter will
be described.
2.1 Duration
The linguistic theories of prosodic prominence listed
above tend to consider syllable duration as one of the
fundamental acoustic parameters for detecting syllable
stress. Unfortunately the automatic segmentation of the
utterance into syllables is a complex task; in [9] we can
find a survey of syllable segmentation algorithms. None of
these methods seem to perform well when applied to
continuous speech. For these reasons, an alternative
duration measure for prosodic prominence detection
should be introduced.
One possible measure seems to be the duration of
syllable nucleus. Considering some utterances taken from
the TIMIT corpus and comparing the duration of the
syllable nucleus with the duration of the entire syllable,
with respect to prominence, and approximating the
0-7803-7946-2/03/$17.00 ©2003 IEEE. ISSPA 2003