Overlapping Sound Event Recognition using Local Spectrogram Features with the Generalised Hough Transform Jonathan Dennis 1,2 , Huy Dat Tran 1 , Eng Siong Chng 2 1 Institute for Infocomm Research, A*STAR, 1 Fusionopolis Way, Singapore 138632 2 School of Computer Engineering, Nanyang Technological University, Singapore {stujwd,hdtran}@i2r.a-star.edu.sg, ASESChng@ntu.edu.sg Abstract We present a novel approach for recognition of overlapping sound events based on the Generalised Hough Transform (GHT) – a technique commonly used for object recognition in the domain of image processing. Unlike our previous work on image-based sound event classification, where we focussed on global image features, here we extract local features from de- tected interest-points in the spectrogram. These form a robust representation of the local region, and when the information from all interest-points in the spectrogram are combined using the GHT, we can form hypotheses for the location of one or more overlapping sound events in the image. Our experiments show promising results, and demonstrate the ability of our ap- proach to recognise overlapping sounds. Index Terms: overlapping, sound events, recognition 1. Introduction The spectrogram of an audio signal provides valuable visual in- formation that, for example, has been historically used to anal- yse the phonetic structure of speech. A trained expert in “spec- trogram reading” [1] is able to pick out the important structures and use these to recognise the underlying speech. Despite this, visual-based techniques for automatic classification of speech have not been heavily researched, in part due to the complicated lexical structure. Sound events, on the other hand are typically more sparse and distinct, making the visual information more tractable for automatic classification. In our previous works [2, 3], we developed an image feature framework for sound event classification. These approaches perform well, particularly in noisy conditions, which vindicates the use of such image processing techniques for sound classifi- cation. Related works on the topic of image-based audio classi- fication have focussed on both environmental sounds and music genre classification [4, 5, 6], as music spectrograms have simi- larities with texture images. Often these approaches just extend frame-based techniques, commonly used in speech processing, or combine block-based techniques with Support Vector Ma- chines (SVM) to classify the whole spectrogram. In this paper, we address the problem of recognising over- lapping sound events, inspired by the visual information. Ac- cording to the LogMax principle [7], when two sounds are added together each cell of the spectrogram belongs to one of the sounds, as the log-magnitude of the stronger sound will al- ways dominate. Hence, we can still visually see overlapping sounds in the spectrogram. However, overlapping conditions is one of the most challenging in audio classification, where performance degrades greatly. While present research focuses more on sound source separation, few works have considered θ !" #! $ % &$ ’(" )(( ) * θ +,%( $ * - Figure 1: Simple example of the Hough Transform for over- lapping straight lines in noise. The result is two strong local maxima in the Hough accumulator indicating the hypotheses, as shown on the right hand side. the simultaneous recognition of overlapped sound signals. A re- lated area is Computational Auditory Scene Analysis (CASA) [8], where the aim is to segment the spectrogram to form an Ideal Binary Mask (IBM). However, this segmentation is prone to errors that affect the subsequent classification. Among the available literature on speech separation, the Factorial Hidden Markov Model (FHMM) technique [7] is the most popular. Re- cently, a method based on cepstral processing and subband dis- tributions has been applied for overlapping audio classification [9], however this method is applicable only for certain sound classes, not the general case of sound events. Here, we propose a novel approach that combines the de- tection and classification of overlapping sound events into a sin- gle framework using local spectrogram features with the Gen- eralised Hough Transform (GHT) [10]. The fundamental idea of the GHT is to map consistent interpretations of a shape into local maxima in the Hough space, which are then separable. In object recognition, the GHT is realised though a sophisticated voting procedure [11]. In this work, we develop a method that first extracts local features surrounding interest-points in the spectrogram. These features, and their temporal locations, are used to create a model which is used for detecting each sound class. The GHT is used as a scoring mechanism, enabling us to detect multiple overlapping sound events in a single clip. Our experiments show promising results, and we discuss how fur- ther could incorporate robustness to noise and distortion. The paper is organised as follows. Section 2 first introduces the idea behind our approach. Section 3 then details the recog- nition algorithm for the proposed method. Section 4 describes our experiments and the results obtained. Finally, Section 5 concludes this work. ISCA Archive http://www.isca-speech.org/archive INTERSPEECH 2012 ISCA's 13 th Annual Conference Portland, OR, USA September 9-13, 2012 INTERSPEECH 2012 2266 10.21437/Interspeech.2012-595