IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 3, MAY 1998 201 HMM-Based Stressed Speech Modeling with Application to Improved Synthesis and Recognition of Isolated Speech Under Stress Sahar E. Bou-Ghazale and John H. L. Hansen, Senior Member, IEEE Abstract—In this study, a novel approach is proposed for mod- eling speech parameter variations between neutral and stressed conditions and employed in a technique for stressed speech synthesis and recognition. The proposed method consists of mod- eling the variations in pitch contour, voiced speech duration, and average spectral structure using hidden Markov models (HMM’s). While HMM’s have traditionally been used for recog- nition applications, here they are employed to statistically model characteristics needed for generating pitch contour and spectral perturbation contour patterns to modify the speaking style of isolated neutral words. The proposed HMM models are both speaker and word-independent, but unique to each speaking style. While the modeling scheme is applicable to a variety of stress and emotional speaking styles, the evaluations presented in this study focus on angry speech, the Lombard effect, and loud spoken speech in three areas. First, formal subjective listener evaluations of the modiﬁed speech conﬁrm the HMM’s ability to capture the parameter variations under stressed conditions. Second, an objective evaluation using a separately formulated stress classiﬁer is employed to assess the presence of stress imparted on the synthetic speech. Finally, the stressed speech is also used for training and shown to measurably improve the performance of an HMM-based stressed speech recognizer. Index Terms— Lombard effect, robust speech recognition, speech synthesis, speech under stress. I. INTRODUCTION I N THIS study, we consider the problem of speech under stress with applications to stress modiﬁcation for speech synthesis, and improved training for robust speech recognition. Stress in this context refers to environmental, emotional, or workload stress. Stress has been shown to alter the normal behavior of human speech production and the resulting speech feature characteristics. The variability introduced by a speaker under stress causes speech recognition systems trained with neutral speech tokens to fail [1]–[4]. Hence, available speech recognition systems are not robust in actual stressful envi- ronments such as ﬁghter cockpits, where a pilot is subjected to a number of stress factors such as G-force (gravity), Manuscript received December 17, 1996; revised June 11, 1997. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Douglas D. O’Shaughnessy. S. E. Bou-Ghazale was with the Robust Speech Processing Laboratory, Duke University, Durham, NC 27708-0291 USA. She is now with the Personal Computing Division of Rockwell Semiconductor Systems, Newport Beach, CA 92660 USA. J. H. L. Hansen is with the Robust Speech Processing Laboratory, Depart- ment of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291 USA (e-mail: jhlh@ee.duke.edu). Publisher Item Identiﬁer S 1063-6676(98)02896-X. environmental stress due to background noise (Lombard effect [5]), 1 workload stress resulting from task requirements of operating in a cockpit, and emotional stress such as fear. In such environments, a speaker may experience a mixture of emotions or stress conditions rather than a single emotion. Therefore, it is important from the standpoint of voice com- munication and speech algorithm development to characterize the effects of each condition in order to understand the combined stress effect on speech characteristics. In addition, the same speaker may be subjected to different levels of stress, from mild to extreme, which may affect the variability of speech characteristics. It should also be noted that each person responds differently to a given stressful condition, and therefore, it is necessary to account for speaker variability under stress. In this paper, we study the effects of individual stressful conditions on speech characteristics as opposed to a mixture of conditions. While a variety of speech under stress conditions are possible, the stress conditions of interest in our study are angry, loud, and the Lombard effect. Although, it is equally feasible to model the speech variations introduced by a particular speaker under stress, here the variations across a number of speakers are modeled. Our modeling is intended to represent general characteristics of speech under stress and not variations particular to an individual speaker. This would allow us to develop a general method of stress perturbation, which could be applied to modify the speaking style of any new input synthesis speaker in a way that would convince a majority of listeners that the modiﬁed speech is under stress. Therefore, this study develops a novel technique for pitch contour, duration and spectral contour modeling using hidden Markov models (HMM’s) for the purpose of stressed speech synthesis with application to stressed speech recognition. The HMM perturbation models are word-independent assuming the word consists of any number of unvoiced regions and one voiced region. The advantages of modeling the parameter variations using HMM’s are as follows. 1) The models can characterize the stressed data and can also reproduce unlimited observation sequences with the same statistical properties as the training data (due to the regenerative property of HMM’s). 1 The Lombard effect results when speakers attempt to modify their speech production system in order to increase communication quality while speaking in a noisy environment. 1063–6676/98$10.00  1998 IEEE