SHOUT DETECTION IN NOISE Jouni Pohjalainen 1 , Paavo Alku 1 , Tomi Kinnunen 2 1 Aalto University, Department of Signal Processing and Acoustics, Espoo, Finland 2 University of Eastern Finland, School of Computing, Joensuu, Finland ABSTRACT For the task of detecting shouted speech in a noisy environment, this paper introduces a system based on mel frequency cepstral co- efﬁcient (MFCC) feature extraction, unsupervised frame dropping and Gaussian mixture model (GMM) classiﬁcation. The evaluation material consists of phonemically identical speech and shouting as well as environmental noise of varying levels. The performance of the shout detection system is analyzed by varying the MFCC fea- ture extraction with respect to 1) the feature vector length and 2) the spectrum estimation method. As for feature vector length, the best performance is obtained using 30 MFCC coefﬁcients, which is more than what is conventionally used. In spectrum estimation, a scheme that combines a linear prediction spectrum envelope with spectral ﬁne structure outperforms the conventional FFT. Index Terms— shout detection 1. INTRODUCTION Recently, several audio surveillance systems have been proposed to detect abnormal or potentially alarming sounds in speciﬁc acoustic environments. Examples include the detection of non-neutral speech and banging in elevator [1], the detection of shouts in train [2] and the detection of screams, gunshots and explosions in urban or mili- tary environments [3]. It can be argued that shouting is a quite generic acoustic indica- tor of a potentially hazardous situation in an environment typically characterized by normal speaking voices and non-vocal environmen- tal sounds. Shouting in such an environment is typically associated with some degree of urgency. Hence, reliable detection of shouted speech in noisy environments is an essential research topic in the area of audio surveillance technology. This topic will be addressed in the present paper by proposing a system using which the perfor- mance of several techniques in shout detection can be compared. Previous studies have examined the detection of shouted speech [2] [4] or screams [5] [6] [3] apart from environmental noise, often also including normal speech as test material [2] [4] [3]. Differently from previous approaches, the present study uses exactly the same textual material for both shouted speech and normal speech. It can be argued that this scenario is more challenging, because when the shouts and normal speech share the same phonemic content, phone- mic differences between the two classes cannot aid the detection. In some previous studies, the robustness of scream detection with re- spect to decreasing signal-to-noise ratio (SNR) has been examined [5] [6] [3] and the performance has been found to degrade steeply when the SNR is close to 0 dB. This degradation has sometimes been tackled by training the shout/scream models with data that al- ready contains the expected type and amount of noise corruption [2] The work was supported by Academy of Finland projects 127345 and 132129. [6], but this calls for a complete retraining whenever the noise envi- ronment changes and, as noted in [6], increases the number of false alarms. In practice, the distance between the microphone and the person shouting determines the SNR, and it is desirable that the per- formance is independent of whether the person shouting is close to or further away from the microphone. Clearly, there is a demand for techniques that improve the noise robustness of shout and scream detection. Towards this end, the present study trains the system on clean (not noisy) vocal data and investigates the degradation of per- formance as the SNR decreases. This is done using two different realistic noise types: factory noise and large crowd babble. The proposed shout detection system is based on two well- known audio recognition techniques: feature extraction based on mel frequency cepstral coefﬁcients (MFCC) and classiﬁcation using Gaussian mixture models (GMMs), both of which have been popular in previous audio surveillance systems, e.g. [1] [2] [3]. These are complemented with several techniques to improve the robustness of shout detection in adverse noise conditions. In particular, the conventional Fourier-based spectrum estimation in the MFCC com- putation is replaced with new methods that combine linear predictive spectral envelope modeling with spectral ﬁne structure, i.e., the fun- damental frequency (F0) and its harmonics. Hence, the importance of F0 and its multiple integers can be increased in the MFCC feature extraction, a goal that is justiﬁed by the fact that shouting in speech communication correlates with the use of high pitch [7]. In addition, the number of MFCC coefﬁcients is varied with different spectral modeling techniques in order to better capture shout-discriminating characteristics. The proposed system utilizes an unsupervised time series segmentation method for energy-based frame dropping prior to GMM training and detection. 2. SHOUT DETECTION SYSTEM 2.1. MFCC feature extraction The input to the system is sampled at 16 kHz and pre-emphasized with Hp(z)=1 - 0.97z -1 . The signal is processed in Hamming- windowed frames of 25 ms with a 10 ms interval between two frames. Fig. 1 shows the complete chain of MFCC computation used in the present work. The feature extraction uses MFCCs as a representation for the short-time magnitude spectrum [8]. Different methods are evaluated for the estimation of the magnitude spectrum. The fast Fourier trans- form (FFT) is the conventional spectrum estimation method for the MFCC computation. Recently, the present authors have investigated the use of different forms of linear predictive models in the MFCC feature extraction for automatic speech recognition and speaker ver- iﬁcation in adverse conditions. In particular, weighted linear predic- tion (WLP) and its variants have led to improved robustness in these applications, e.g. [9] [10] [11]. The explanation of LP and WLP is 4968 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011