IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 4, JULY 2000 429 A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress Sahar E. Bou-Ghazale, Member, IEEE, and John H. L. Hansen, Senior Member, IEEE Abstract—It is well known that the performance of speech recognition algorithms degrade in the presence of adverse envi- ronments where a speaker is under stress, emotion, or Lombard effect. This study evaluates the effectiveness of traditional features in recognition of speech under stress and formulates new features which are shown to improve stressed speech recognition. The focus is on formulating robust features which are less dependent on the speaking conditions rather than applying compensation or adaptation techniques. The stressed speaking styles considered are simulated angry and loud, Lombard effect speech, and noisy actual stressed speech from the SUSAS database which is available on CD-ROM through the NATO IST/TG-01 research group and LDC 1 . In addition, this study investigates the immunity of linear prediction power spectrum and fast Fourier transform power spectrum to the presence of stress. Our results show that unlike fast Fourier transform’s (FFT) immunity to noise, the linear prediction power spectrum is more immune than FFT to stress as well as to a combination of a noisy and stressful environment. Finally, the effect of various parameter processing such as fixed versus variable preemphasis, liftering, and fixed versus cepstral mean normalization are studied. Two alternative frequency partitioning methods are proposed and compared with traditional mel-frequency cepstral coefficients (MFCC) features for stressed speech recognition. It is shown that the alternate filterbank frequency partitions are more effective for recognition of speech under both simulated and actual stressed conditions. Index Terms—Linear prediction, Lombard effect, speech recog- nition, speech under stress. I. INTRODUCTION I T is well known that the performance of speech recognition systems degrade under the presence of stress [2], [4]–[6], [8], [20]. Stress in this context refers to speech produced under environmental, emotional, or workload stress. The stress conditions considered in this study include simulated angry and loud, Lombard effect conditions, and actual stressed speech all obtained from the SUSAS (Speech Under Simulated and Manuscript received November 3, 1997; revised June 21, 1999. This work was supported by a grant from the U.S. Air Force Research Laboratory, Rome, NY. The associate editor coordinating the review of this manuscript and ap- proving it for publication was Dr. Wu Chou. S. E. Bou-Ghazale was with Robust Speech Processing Laboratory, Center for Spoken Language Research, University of Colorado, Boulder, CO 80309-0258 USA. She is now with Network Access Division, Conexant Systems, Inc., New- port Beach, CA 92658-8902 USA. J. H. L. Hansen is with Robust Speech Processing Laboratory, Center for Spoken Language Research, University of Colorado, Boulder, CO 80309-0258 USA (e-mail: jhlh@cslr.colorado.edu; http://cslr.colorado.edu/rspl/). Publisher Item Identifier S 1063-6676(00)05331-1. 1 http://morph.ldc.upenn.edu/Catalog/LDC99S78.html. Fig. 1. Types of distortion which can be addressed for robust speech recognition. Actual Stress) [9] database. The stress condition referred to as Lombard effect results when a speaker attempts to modify his or her speech production system while speaking in a noisy environment [13], [20]. To improve the performance of speech recognition algorithms under stress, a number of methods have been considered. These fall into three general areas of 1) robust features, 2) stress equalization methods, and 3) model adjustment or training methods. Fig. 1 shows a general speech recognition scenario which considers a variety of speech/speaker distortions, and the three general approaches to robust speech recognition. For this scenario, we assume that a speaker is exposed to some adverse environment, where ambient noise is present and a stress induced task is required (or the speaker is experiencing emotional stress). The adverse environment could be a noisy automobile where cellular communication is used, high-stress noisy helicopter or aircraft cockpits, or other environments where hands-free operation is needed. Since the user task could be demanding, the speaker is required to divert a measured level of cognitive processing, leaving formulation of speech for recognition as a secondary task. Some speech recognition studies have adapted the recognizer to the input stressed speech during training [14], or compensated for the effect of stress during recognition testing phase (e.g., formant location and bandwidth stress equalization [6], [7], [21]; whole-word cepstral compensation [2]; slope-dependent weighting [15]; formant shifting [17]; source-generator based codebook stress compensation [16], 1063–6676/00$10.00 © 2000 IEEE