IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 4, JULY 2000 429
A Comparative Study of Traditional and
Newly Proposed Features for Recognition of
Speech Under Stress
Sahar E. Bou-Ghazale, Member, IEEE, and John H. L. Hansen, Senior Member, IEEE
Abstract—It is well known that the performance of speech
recognition algorithms degrade in the presence of adverse envi-
ronments where a speaker is under stress, emotion, or Lombard
effect. This study evaluates the effectiveness of traditional features
in recognition of speech under stress and formulates new features
which are shown to improve stressed speech recognition. The
focus is on formulating robust features which are less dependent
on the speaking conditions rather than applying compensation or
adaptation techniques. The stressed speaking styles considered
are simulated angry and loud, Lombard effect speech, and noisy
actual stressed speech from the SUSAS database which is available
on CD-ROM through the NATO IST/TG-01 research group and
LDC
1
. In addition, this study investigates the immunity of linear
prediction power spectrum and fast Fourier transform power
spectrum to the presence of stress. Our results show that unlike
fast Fourier transform’s (FFT) immunity to noise, the linear
prediction power spectrum is more immune than FFT to stress
as well as to a combination of a noisy and stressful environment.
Finally, the effect of various parameter processing such as fixed
versus variable preemphasis, liftering, and fixed versus cepstral
mean normalization are studied. Two alternative frequency
partitioning methods are proposed and compared with traditional
mel-frequency cepstral coefficients (MFCC) features for stressed
speech recognition. It is shown that the alternate filterbank
frequency partitions are more effective for recognition of speech
under both simulated and actual stressed conditions.
Index Terms—Linear prediction, Lombard effect, speech recog-
nition, speech under stress.
I. INTRODUCTION
I
T is well known that the performance of speech recognition
systems degrade under the presence of stress [2], [4]–[6],
[8], [20]. Stress in this context refers to speech produced
under environmental, emotional, or workload stress. The stress
conditions considered in this study include simulated angry and
loud, Lombard effect conditions, and actual stressed speech
all obtained from the SUSAS (Speech Under Simulated and
Manuscript received November 3, 1997; revised June 21, 1999. This work
was supported by a grant from the U.S. Air Force Research Laboratory, Rome,
NY. The associate editor coordinating the review of this manuscript and ap-
proving it for publication was Dr. Wu Chou.
S. E. Bou-Ghazale was with Robust Speech Processing Laboratory, Center for
Spoken Language Research, University of Colorado, Boulder, CO 80309-0258
USA. She is now with Network Access Division, Conexant Systems, Inc., New-
port Beach, CA 92658-8902 USA.
J. H. L. Hansen is with Robust Speech Processing Laboratory, Center for
Spoken Language Research, University of Colorado, Boulder, CO 80309-0258
USA (e-mail: jhlh@cslr.colorado.edu; http://cslr.colorado.edu/rspl/).
Publisher Item Identifier S 1063-6676(00)05331-1.
1
http://morph.ldc.upenn.edu/Catalog/LDC99S78.html.
Fig. 1. Types of distortion which can be addressed for robust speech
recognition.
Actual Stress) [9] database. The stress condition referred to
as Lombard effect results when a speaker attempts to modify
his or her speech production system while speaking in a
noisy environment [13], [20]. To improve the performance
of speech recognition algorithms under stress, a number of
methods have been considered. These fall into three general
areas of 1) robust features, 2) stress equalization methods,
and 3) model adjustment or training methods. Fig. 1 shows
a general speech recognition scenario which considers a
variety of speech/speaker distortions, and the three general
approaches to robust speech recognition. For this scenario, we
assume that a speaker is exposed to some adverse environment,
where ambient noise is present and a stress induced task is
required (or the speaker is experiencing emotional stress).
The adverse environment could be a noisy automobile where
cellular communication is used, high-stress noisy helicopter
or aircraft cockpits, or other environments where hands-free
operation is needed. Since the user task could be demanding,
the speaker is required to divert a measured level of cognitive
processing, leaving formulation of speech for recognition as a
secondary task. Some speech recognition studies have adapted
the recognizer to the input stressed speech during training
[14], or compensated for the effect of stress during recognition
testing phase (e.g., formant location and bandwidth stress
equalization [6], [7], [21]; whole-word cepstral compensation
[2]; slope-dependent weighting [15]; formant shifting [17];
source-generator based codebook stress compensation [16],
1063–6676/00$10.00 © 2000 IEEE