SHOUT DETECTION IN NOISE
Jouni Pohjalainen
1
, Paavo Alku
1
, Tomi Kinnunen
2
1
Aalto University, Department of Signal Processing and Acoustics, Espoo, Finland
2
University of Eastern Finland, School of Computing, Joensuu, Finland
ABSTRACT
For the task of detecting shouted speech in a noisy environment,
this paper introduces a system based on mel frequency cepstral co-
efficient (MFCC) feature extraction, unsupervised frame dropping
and Gaussian mixture model (GMM) classification. The evaluation
material consists of phonemically identical speech and shouting as
well as environmental noise of varying levels. The performance of
the shout detection system is analyzed by varying the MFCC fea-
ture extraction with respect to 1) the feature vector length and 2) the
spectrum estimation method. As for feature vector length, the best
performance is obtained using 30 MFCC coefficients, which is more
than what is conventionally used. In spectrum estimation, a scheme
that combines a linear prediction spectrum envelope with spectral
fine structure outperforms the conventional FFT.
Index Terms— shout detection
1. INTRODUCTION
Recently, several audio surveillance systems have been proposed to
detect abnormal or potentially alarming sounds in specific acoustic
environments. Examples include the detection of non-neutral speech
and banging in elevator [1], the detection of shouts in train [2] and
the detection of screams, gunshots and explosions in urban or mili-
tary environments [3].
It can be argued that shouting is a quite generic acoustic indica-
tor of a potentially hazardous situation in an environment typically
characterized by normal speaking voices and non-vocal environmen-
tal sounds. Shouting in such an environment is typically associated
with some degree of urgency. Hence, reliable detection of shouted
speech in noisy environments is an essential research topic in the
area of audio surveillance technology. This topic will be addressed
in the present paper by proposing a system using which the perfor-
mance of several techniques in shout detection can be compared.
Previous studies have examined the detection of shouted speech
[2] [4] or screams [5] [6] [3] apart from environmental noise, often
also including normal speech as test material [2] [4] [3]. Differently
from previous approaches, the present study uses exactly the same
textual material for both shouted speech and normal speech. It can
be argued that this scenario is more challenging, because when the
shouts and normal speech share the same phonemic content, phone-
mic differences between the two classes cannot aid the detection. In
some previous studies, the robustness of scream detection with re-
spect to decreasing signal-to-noise ratio (SNR) has been examined
[5] [6] [3] and the performance has been found to degrade steeply
when the SNR is close to 0 dB. This degradation has sometimes
been tackled by training the shout/scream models with data that al-
ready contains the expected type and amount of noise corruption [2]
The work was supported by Academy of Finland projects 127345 and
132129.
[6], but this calls for a complete retraining whenever the noise envi-
ronment changes and, as noted in [6], increases the number of false
alarms. In practice, the distance between the microphone and the
person shouting determines the SNR, and it is desirable that the per-
formance is independent of whether the person shouting is close to
or further away from the microphone. Clearly, there is a demand for
techniques that improve the noise robustness of shout and scream
detection. Towards this end, the present study trains the system on
clean (not noisy) vocal data and investigates the degradation of per-
formance as the SNR decreases. This is done using two different
realistic noise types: factory noise and large crowd babble.
The proposed shout detection system is based on two well-
known audio recognition techniques: feature extraction based on
mel frequency cepstral coefficients (MFCC) and classification using
Gaussian mixture models (GMMs), both of which have been popular
in previous audio surveillance systems, e.g. [1] [2] [3]. These are
complemented with several techniques to improve the robustness
of shout detection in adverse noise conditions. In particular, the
conventional Fourier-based spectrum estimation in the MFCC com-
putation is replaced with new methods that combine linear predictive
spectral envelope modeling with spectral fine structure, i.e., the fun-
damental frequency (F0) and its harmonics. Hence, the importance
of F0 and its multiple integers can be increased in the MFCC feature
extraction, a goal that is justified by the fact that shouting in speech
communication correlates with the use of high pitch [7]. In addition,
the number of MFCC coefficients is varied with different spectral
modeling techniques in order to better capture shout-discriminating
characteristics. The proposed system utilizes an unsupervised time
series segmentation method for energy-based frame dropping prior
to GMM training and detection.
2. SHOUT DETECTION SYSTEM
2.1. MFCC feature extraction
The input to the system is sampled at 16 kHz and pre-emphasized
with Hp(z)=1 - 0.97z
-1
. The signal is processed in Hamming-
windowed frames of 25 ms with a 10 ms interval between two
frames. Fig. 1 shows the complete chain of MFCC computation
used in the present work.
The feature extraction uses MFCCs as a representation for the
short-time magnitude spectrum [8]. Different methods are evaluated
for the estimation of the magnitude spectrum. The fast Fourier trans-
form (FFT) is the conventional spectrum estimation method for the
MFCC computation. Recently, the present authors have investigated
the use of different forms of linear predictive models in the MFCC
feature extraction for automatic speech recognition and speaker ver-
ification in adverse conditions. In particular, weighted linear predic-
tion (WLP) and its variants have led to improved robustness in these
applications, e.g. [9] [10] [11]. The explanation of LP and WLP is
4968 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011