IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 4, JULY 2003 321
Prosodic and Accentual Information
for Automatic Speech Recognition
Diego H. Milone, Student Member, IEEE, and Antonio J. Rubio, Senior Member, IEEE
Abstract—Various aspects relating to the human production
and perception of speech have gradually been incorporated into
automatic speech recognition systems. Nevertheless, the set of
speech prosodic features has not yet been used in an explicit way
in the recognition process itself. This study presents an analysis
of prosody’s three most important parameters, namely energy,
fundamental frequency and duration, together with a method for
incorporating this information into automatic speech recognition.
On the basis of a preliminary analysis, a design is proposed
for a prosodic feature classifier in which these parameters are
associated with orthographic accentuation. Prosodic-accentual
features are incorporated in a hidden Markov model recognizer;
their theoretical formulation and experimental setup are then
presented. Several experiments were conducted to show how the
method performs with a Spanish continuous-speech database.
Using this approach to process other database subsets, we obtained
a word recognition error reduction rate of 28.91%.
Index Terms—Accentuation, continuous speech recognition, lan-
guage models, prosody.
I. INTRODUCTION
C
ONSIDERABLE progress has been made in automatic
speech recognition (ASR) technology over the last 20
years. The incorporation of hidden Markov models (HMM)
in ASR in the 1980s led to very high levels of performance,
mainly thanks to the way this technique enables the time vari-
ability of speech to be modeled. Research using various HMM
paradigms have resulted in the incorporation of a number of
features that attempt to model human perception. Research has
been carried out in the modeling of speech as it relates to the
recognition of phonemes, isolated words, connected words and
continuous speech (CSR) [52]. In the last few years, techniques
such as context dependent phoneme modeling (triphones) and
language modeling have been incorporated [23], [47].
In recent years, acoustic models have evolved from vector
quantization-based models to continuous observation density
hidden Markov models (CHMM). In vector quantization sys-
tems, acoustic features are modeled as chains composed of a fi-
nite set of discrete elements from the vector quantizer. This gave
rise to discrete HMM [51]. However, in CHMM it is possible to
Manuscript received June 2, 2000; revised January 20, 2003. This work was
supported by the National University of Entre Ríos (UNER-PID#6036&6062),
National University of Litoral (UNL-FOMEC#542), and the University of
Granada.
D. H. Milone is with the Faculty of Engineering UNER and Faculty of En-
gineering and Hydric Sciences UNL, Cybernetics Laboratory, CP 3101, Oro
Verde, Argentina (e-mail: d.milone@ieee.org)
A. J. Rubio is with Department of Electronics and Computers Technology,
UGR, Spain (e-mail: rubio@ugr.es)
Digital Object Identifier 10.1109/TSA.2003.814368
use continuous observation densities instead of vector quanti-
zation, thus taking advantage of modeling speech-selected fea-
tures through Gaussian mixtures [31]. In the field of acoustic
modeling several speech parameterization techniques may be
applied, and important advances have been made, such as linear
predictive coding [50], cepstral coefficients and mel-cepstral
with delta and acceleration coefficients [18].
Computing optimization and its associated algorithms is
another field in which considerable progress has been made.
Semi-Continuous HMM (SCHMM), also termed tied-mixture
models, are examples of algorithms aimed at computational
efficiency. The remaining innovations worth mentioning are
those related to the speaker’s adaptation of estimated parame-
ters and robust speech recognition [25].
Neural networks constitute another important technique that
has been successfully applied in some aspects of ASR. In this
area, the pioneering works in self-organizing maps [26] and
time-delay neural networks [66] should be cited.
Although exhaustive research has been carried out in the
field of ASR, computers are still far from attaining the recog-
nition capabilities of human beings [32]. One of the fields in
which meaningful improvements have not yet been made is the
incorporation of prosodic features into the recognition process.
In contrast, we find that prosody is given fundamental impor-
tance in text-to-speech (TTS) systems [55]. These analyses
and proposed models provide important data about the natural
way in which human beings use prosody in spoken discourse.
Basically, in the case of TTS systems, prosody gives the nat-
uralness sought in the synthesized speech [61]. Furthermore,
some very interesting experiments have studied human speech
recognition abilities in different prosodic conditions [21] (also
[6] for infants, [27] for spontaneous speech and [35] for di-
alogue/monologue). A typical situation, encountered daily, is
the difficulty in recognizing speech affected by regional ac-
cents [2]. This has been studied in the context of ASR in [22].
Another case in which prosodic information and its utilization
in spoken language can be seen is speaker identification, see
for example [56].
In addition, it is important to note that prosodic modifica-
tions in the utterance evidently induce a considerable mod-
ification of other parameters that are explicitly modeled in
current recognizers. For example, it can clearly be seen that
the spectral characteristics of vowels are modified to a sig-
nificant degree when the intonation changes. Furthermore, the
duration of phonemes (mainly vowels) undergoes a notable
variation depending on the semantic, syntactic and even ortho-
graphic characteristics transmitted in spontaneous speech [4],
[13]. Considerable improvements in recognition performance
can be made by simply taking into account the speaking rate
1063-6676/03$17.00 © 2003 IEEE