IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 4, JULY 2003 321 Prosodic and Accentual Information for Automatic Speech Recognition Diego H. Milone, Student Member, IEEE, and Antonio J. Rubio, Senior Member, IEEE Abstract—Various aspects relating to the human production and perception of speech have gradually been incorporated into automatic speech recognition systems. Nevertheless, the set of speech prosodic features has not yet been used in an explicit way in the recognition process itself. This study presents an analysis of prosody’s three most important parameters, namely energy, fundamental frequency and duration, together with a method for incorporating this information into automatic speech recognition. On the basis of a preliminary analysis, a design is proposed for a prosodic feature classifier in which these parameters are associated with orthographic accentuation. Prosodic-accentual features are incorporated in a hidden Markov model recognizer; their theoretical formulation and experimental setup are then presented. Several experiments were conducted to show how the method performs with a Spanish continuous-speech database. Using this approach to process other database subsets, we obtained a word recognition error reduction rate of 28.91%. Index Terms—Accentuation, continuous speech recognition, lan- guage models, prosody. I. INTRODUCTION C ONSIDERABLE progress has been made in automatic speech recognition (ASR) technology over the last 20 years. The incorporation of hidden Markov models (HMM) in ASR in the 1980s led to very high levels of performance, mainly thanks to the way this technique enables the time vari- ability of speech to be modeled. Research using various HMM paradigms have resulted in the incorporation of a number of features that attempt to model human perception. Research has been carried out in the modeling of speech as it relates to the recognition of phonemes, isolated words, connected words and continuous speech (CSR) [52]. In the last few years, techniques such as context dependent phoneme modeling (triphones) and language modeling have been incorporated [23], [47]. In recent years, acoustic models have evolved from vector quantization-based models to continuous observation density hidden Markov models (CHMM). In vector quantization sys- tems, acoustic features are modeled as chains composed of a fi- nite set of discrete elements from the vector quantizer. This gave rise to discrete HMM [51]. However, in CHMM it is possible to Manuscript received June 2, 2000; revised January 20, 2003. This work was supported by the National University of Entre Ríos (UNER-PID#6036&6062), National University of Litoral (UNL-FOMEC#542), and the University of Granada. D. H. Milone is with the Faculty of Engineering UNER and Faculty of En- gineering and Hydric Sciences UNL, Cybernetics Laboratory, CP 3101, Oro Verde, Argentina (e-mail: d.milone@ieee.org) A. J. Rubio is with Department of Electronics and Computers Technology, UGR, Spain (e-mail: rubio@ugr.es) Digital Object Identifier 10.1109/TSA.2003.814368 use continuous observation densities instead of vector quanti- zation, thus taking advantage of modeling speech-selected fea- tures through Gaussian mixtures [31]. In the field of acoustic modeling several speech parameterization techniques may be applied, and important advances have been made, such as linear predictive coding [50], cepstral coefficients and mel-cepstral with delta and acceleration coefficients [18]. Computing optimization and its associated algorithms is another field in which considerable progress has been made. Semi-Continuous HMM (SCHMM), also termed tied-mixture models, are examples of algorithms aimed at computational efficiency. The remaining innovations worth mentioning are those related to the speaker’s adaptation of estimated parame- ters and robust speech recognition [25]. Neural networks constitute another important technique that has been successfully applied in some aspects of ASR. In this area, the pioneering works in self-organizing maps [26] and time-delay neural networks [66] should be cited. Although exhaustive research has been carried out in the field of ASR, computers are still far from attaining the recog- nition capabilities of human beings [32]. One of the fields in which meaningful improvements have not yet been made is the incorporation of prosodic features into the recognition process. In contrast, we find that prosody is given fundamental impor- tance in text-to-speech (TTS) systems [55]. These analyses and proposed models provide important data about the natural way in which human beings use prosody in spoken discourse. Basically, in the case of TTS systems, prosody gives the nat- uralness sought in the synthesized speech [61]. Furthermore, some very interesting experiments have studied human speech recognition abilities in different prosodic conditions [21] (also [6] for infants, [27] for spontaneous speech and [35] for di- alogue/monologue). A typical situation, encountered daily, is the difficulty in recognizing speech affected by regional ac- cents [2]. This has been studied in the context of ASR in [22]. Another case in which prosodic information and its utilization in spoken language can be seen is speaker identification, see for example [56]. In addition, it is important to note that prosodic modifica- tions in the utterance evidently induce a considerable mod- ification of other parameters that are explicitly modeled in current recognizers. For example, it can clearly be seen that the spectral characteristics of vowels are modified to a sig- nificant degree when the intonation changes. Furthermore, the duration of phonemes (mainly vowels) undergoes a notable variation depending on the semantic, syntactic and even ortho- graphic characteristics transmitted in spontaneous speech [4], [13]. Considerable improvements in recognition performance can be made by simply taking into account the speaking rate 1063-6676/03$17.00 © 2003 IEEE