Frame-Level Speech/Music Discrimination using AdaBoost Norman Casagrande University of Montreal Department of Computer Science CP 6128, Succ. Centre-Ville Montreal, Quebec H3C 3J7 Canada casagran@iro.umontreal.ca Douglas Eck University of Montreal Department of Computer Science CP 6128, Succ. Centre-Ville Montreal, Quebec H3C 3J7 Canada eckdoug@iro.umontreal.ca Bal´ azs K´ egl University of Montreal Department of Computer Science CP 6128, Succ. Centre-Ville Montreal, Quebec H3C 3J7 Canada kegl@iro.umontreal.ca ABSTRACT In this paper we adapt an AdaBoost-based image process- ing algorithm to the task of predicting whether an audio signal contains speech or music. We derive a frame-level discriminator that is both fast and accurate. Using a sim- ple FFT and no built-in prior knowledge of signal struc- ture we obtain an accuracy of 88% on frames sampled at 20ms intervals. When we smooth the output of the clas- sifier with the output of the previous 40 frames our fore- cast rate rises to 93% on the Scheirer-Slaney (Scheirer and Slaney, 1997) database. To demonstrate the efficiency and effectiveness of the model, we have implemented it as a graphical real-time plugin to the popular Winamp audio player. 1 Introduction The ability to automatically discriminate speech from mu- sic in an audio signal is useful in domains where a partic- ular type of information is of interest, such as in automatic audio news transcription of a radio broadcast, where non- speech would presumably be discarded. Previous mod- els have employed a mixture of simple features that cap- ture certain temporal and spectral features of the signal (Scheirer and Slaney, 1997; Saunders, 1996). including for example pitch, amplitude, zero crossing rate, cepstral values and line spectral frequencies (LSF). More recently, other approaches have used the posterior probability of a frame being in a particular phoneme class (Williams and Ellis, 1999), HMMs that integrate posterior probability features based on entropy and “dynamism” (Ajmera et al., 2002), and a mixture of Gaussians on small frames (Ez- zaidi and Rouat, 2002). We have adapted a successful and robust approach for object detection (Viola and Jones, 2001) to this task. Our model works by exploiting regular geometric patterns in speech and non-speech audio spectrograms. These reg- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee pro- vided that copies are not made or distributed for profit or com- mercial advantage and that copies bear this notice and the full citation on the first page. c 2005 Queen Mary, University of London ularities are detectable visually, as demonstrated by the ability of trained observers to identify speech structure (e.g. vowel formant structure, consonant onsets) and mu- sical structure (e.g. note onsets and harmonic pitch struc- ture) through visual inspection of a spectrogram. We demonstrate in this paper that by exploiting geometric reg- ularities in a two-dimensional representation of sound, we are able to obtain good accuracy results (88%) for 20ms frame categorization with no built-in prior knowledge and at very low computational cost. When smoothing is em- ployed over 40 previous frames (800ms), our accuracy rises to 93%. This compares favorably with other mod- els on the same dataset. Despite being motivated by work in vision, this model is well suited for audio signal processing. Though it treats individual 20ms slices of music as having fixed geometry, it places no limitations on the geometry of entire songs. For example, it places no constraints on song length nor does it require random access to the audio signal. In other words, this approach is causal and is able to process audio streams online and in real time. 2 The algorithm In order to build a good binary discriminator, one must first find a set of salient features that separate the two classes with the largest margin possible. To detect objects in an image, Viola and Jones employed a set of simple Haar-like (first proposed by Papageorgiou et al. (1998)) rectangles depicted in Figure 1. These features compute and subtract the sum of pixels in the white area from the sum of pixels in the black area. The areas can have dif- ferent shapes and sizes, and can be placed at different x and y coordinates of the image. A discriminator using a single of these features is called a weak learner because, used alone, it cannot achieve very good discrimination. However, when these features are combined in an additive model, the resulting classifier can perform very well. In their work on two-dimensional images, Viola and Jones showed that with enough features, it is possible to detect complex objects like faces. 2.1 AdaBoost To additively combine the weak learners, we use the ADABOOST algorithm (Freund and Schapire, 1996), 345