Cascade classiﬁers trained on Gammatonegrams for reliably detecting Audio Events Pasquale Foggia, Alessia Saggese, Nicola Strisciuglio, Mario Vento Dept. of Computer Eng. and Electrical Eng. and Applied Mathematics University of Salerno Via Giovanni Paolo II, 132, Fisciano (SA), Italy {pfoggia, asaggese, nstrisciuglio, mvento}@unisa.it Abstract In this paper we propose a novel method for the detec- tion of events of interest through audio analysis. The system that we propose is based on the representation of the audio streams through a Gammatone image, which describes the time-frequency distribution of the energy of the signal; this representation is inspired by the functioning of the human auditory system. A pool of AdaBoost cascade classiﬁers, one for each class of events of interest, is involved in the event detection stage. The performance of the proposed sys- tem has been evaluated on a large data set of audio events for surveillance applications and the achieved results, com- pared with two state of the art approaches, conﬁrm its ef- fectiveness. 1. Introduction Nowadays, intelligent surveillance systems are widely employed in contexts where proactive monitoring is re- quired to support human operators. Currently used systems are mainly based on the automatic analysis of video streams from surveillance cameras for several purposes, for instance object tracking [7] or activity recognition [9]. However, there are cases in which the use of video-based surveillance suffers from several problems. For instance, in poorly illu- minated zones during the night or in huge parking areas it is difﬁcult and very expensive to achieve a good coverage of the whole scene by means of cameras. In these kinds of situation, the analysis of the audio stream captured with a microphone could be useful to detect abnormal events and consequently ﬁre an alarm to the human operator. There are many kinds of event that can be effectively detected by using audio sensors such as gunshots, human shouting or crying, glass breaking and so on. The corresponding events cannot be simply detected by looking at the video stream as the visual appearance can be challenging to interpret. Think, for instance, of a gunshot arising inside a crowd. Thus, automatic systems for abnormal audio events detec- tion can be a further source of information and used in alter- native or in combination with video analytic systems, due to the presence of microphones on many commercial IP cam- eras. This is simpliﬁed by the fact that many IP cameras are today equipped with microphones and the use of an audio surveillance application can be an unexpensive add-on to existing appliances. Another potential application of audio analysis can be the localization of sound sources [14, 13] in order to point a PTZ camera towards the area where the abnormal event is occurring. One of the difﬁculties in audio event detection is that audio events can occur at different time scales: a gunshot, for instance, is an impulsive sound and has a duration that is much shorter than a sustained sound like a scream. A sound of interest can also be signiﬁcantly altered by the presence of a strong background sound. Thus, it is not easy to design a system able to perform the representation and recognition of events having such different durations. The interest of the scientiﬁc community in the ﬁeld of audio surveillance applications is growing in the last years and a number of papers has been published. Gaussian Mix- ture Models (GMM) classiﬁers have been used to approach the problem in different ways. They have been employed to classify Mel-Frequency Cepstral Coefﬁcients (MFCC) based [3] or wavelet-based cepstral coefﬁcients [15] fea- ture vectors that represent the audio signal in order to detect screams or gun shots, or to address the background sound modeling problem [6]. GMMs have been employed also in more complex architectures such as a multi-stage classi- ﬁer to separate abnormal events from the background and then classify them [8] or in combination with Support Vec- tor Machine (SVM) [12]. In [4], impulsive and sustained sounds are classiﬁed by means of two classiﬁers that work at different time scales, while in [2], the sound is treated as a sequence of symbols that represent spectral shapes and