Cascade classifiers trained on Gammatonegrams for reliably detecting Audio Events Pasquale Foggia, Alessia Saggese, Nicola Strisciuglio, Mario Vento Dept. of Computer Eng. and Electrical Eng. and Applied Mathematics University of Salerno Via Giovanni Paolo II, 132, Fisciano (SA), Italy {pfoggia, asaggese, nstrisciuglio, mvento}@unisa.it Abstract In this paper we propose a novel method for the detec- tion of events of interest through audio analysis. The system that we propose is based on the representation of the audio streams through a Gammatone image, which describes the time-frequency distribution of the energy of the signal; this representation is inspired by the functioning of the human auditory system. A pool of AdaBoost cascade classifiers, one for each class of events of interest, is involved in the event detection stage. The performance of the proposed sys- tem has been evaluated on a large data set of audio events for surveillance applications and the achieved results, com- pared with two state of the art approaches, confirm its ef- fectiveness. 1. Introduction Nowadays, intelligent surveillance systems are widely employed in contexts where proactive monitoring is re- quired to support human operators. Currently used systems are mainly based on the automatic analysis of video streams from surveillance cameras for several purposes, for instance object tracking [7] or activity recognition [9]. However, there are cases in which the use of video-based surveillance suffers from several problems. For instance, in poorly illu- minated zones during the night or in huge parking areas it is difficult and very expensive to achieve a good coverage of the whole scene by means of cameras. In these kinds of situation, the analysis of the audio stream captured with a microphone could be useful to detect abnormal events and consequently fire an alarm to the human operator. There are many kinds of event that can be effectively detected by using audio sensors such as gunshots, human shouting or crying, glass breaking and so on. The corresponding events cannot be simply detected by looking at the video stream as the visual appearance can be challenging to interpret. Think, for instance, of a gunshot arising inside a crowd. Thus, automatic systems for abnormal audio events detec- tion can be a further source of information and used in alter- native or in combination with video analytic systems, due to the presence of microphones on many commercial IP cam- eras. This is simplified by the fact that many IP cameras are today equipped with microphones and the use of an audio surveillance application can be an unexpensive add-on to existing appliances. Another potential application of audio analysis can be the localization of sound sources [14, 13] in order to point a PTZ camera towards the area where the abnormal event is occurring. One of the difficulties in audio event detection is that audio events can occur at different time scales: a gunshot, for instance, is an impulsive sound and has a duration that is much shorter than a sustained sound like a scream. A sound of interest can also be significantly altered by the presence of a strong background sound. Thus, it is not easy to design a system able to perform the representation and recognition of events having such different durations. The interest of the scientific community in the field of audio surveillance applications is growing in the last years and a number of papers has been published. Gaussian Mix- ture Models (GMM) classifiers have been used to approach the problem in different ways. They have been employed to classify Mel-Frequency Cepstral Coefficients (MFCC) based [3] or wavelet-based cepstral coefficients [15] fea- ture vectors that represent the audio signal in order to detect screams or gun shots, or to address the background sound modeling problem [6]. GMMs have been employed also in more complex architectures such as a multi-stage classi- fier to separate abnormal events from the background and then classify them [8] or in combination with Support Vec- tor Machine (SVM) [12]. In [4], impulsive and sustained sounds are classified by means of two classifiers that work at different time scales, while in [2], the sound is treated as a sequence of symbols that represent spectral shapes and