A DYNAMIC PROGRAMMING APPROACH TO SPEECH/MUSIC DISCRIMINATION OF RADIO RECORDINGS Aggelos Pikrakis, Theodoros Giannakopoulos and Sergios Theodoridis Dept. of Informatics and Telecommunications, University of Athens, Greece e-mail: {pikrakis, tyiannak, stheodor}@di.uoa.gr, URL: http://www.di.uoa.gr/dsp ABSTRACT This paper treats speech/music discrimination of radio recordings as a maximization task, where the solution is ob- tained by means of dynamic programming. The proposed method seeks the sequence of segments and respective class labels (i.e., speech/music) that maximize the product of pos- terior class label probabilities, given the within the segments data. To this end, a Bayesian Network combiner is embed- ded as a posterior probability estimator. Tests have been per- formed using a large set of radio recordings with several mu- sic genres. The experiments show that the proposed scheme leads to an overall performance of 92.32%. Experiments are also reported on a genre basis and a comparison with exist- ing methods is given. 1. INTRODUCTION Speech/Music discrimination refers to the problem of seg- menting an audio stream and labeling each segment as either speech or music. Since the first attempts in the mid 90’s, a number of speech / music discrimination systems have been proposed in various application fields. In [1], a real-time technique for speech/music discrim- ination was proposed, focusing on the automatic monitor- ing of radio stations, using features related to the short-term energy and zero-crossing rate (ZCR). In [2], thirteen audio features were used in order to train different types of mul- tidimensional classifiers, such as a Gaussian MAP estima- tor and a nearest neighbor classifier. In [3], energy, ZCR and fundamental frequency were used as features in order to achieve analysis of on-line audiovisual data. Segmen- tation/classification was achieved by means of a procedure based on heuristic rules. A framework based on a combina- tion of standard Hidden Markov Models and Multilayer Per- ceptrons (MLP) was used in [4] for speech/music discrim- ination of broadcast news. An Adaboost - based algorithm, applied on the spectrogram of the audio samples, was used in [5] for frame-level discrimination of speech and music. In [6], energy and ZCR were employed as features and classifi- cation was achieved by means of a set of heuristic criteria in an attempt to exploit the nature of speech and music signals. The majority of the previously described methods deal with the problem of speech/music discrimination in two sep- arate steps: first, the audio signal is split into segments by detecting abrupt changes in the signal statistics and at a sec- ond step the extracted segments are classified as speech or music by using standard classification schemes. The work in [4] differs in the sense that the two tasks are performed jointly by means of a standard HMM, where, for each state, a MLP is used as an estimator of the continuous observation densities required by the HMM. The method that we propose in this paper formulates speech/music discrimination as a maximization task. In other words, the method seeks the sequence of segments and the respective class labels (i.e., speech/music) that maximizes the product of posterior (class label) probabilities, given the segments data. In order to estimate the required posterior probabilities, a Bayesian Network (BN) Combiner is trained and used. Since an exhaustive approach to this solution is unrealistic, we resort to dynamic programming to solve this maximization task. Section 2 describes the feature extraction stage. Section 2.1 formulates speech/music discrimination as a maximiza- tion task and provides a dynamic programming solution. The BN combiner architecture and related issues are given in Sec- tion 2.2. The datasets that we have used, the method’s perfor- mance (both average and on a radio genre basis), as well as a comparison with other approaches are presented in Section 3. 2. FEATURE EXTRACTION At a first step, the audio recording is broken into a se- quence of non-overlapping short-term frames and five au- dio features are extracted per frame. At the end of this feature extraction stage, the audio recording is represented by a sequence F of five-dimensional feature vectors, i.e., F = {O 1 , O 2 ,..., O T }, where T is the number of short-term frames. The specific choice of features was the result of ex- tensive experimentation. It must be emphasized that this is not an optimal feature set in any sense, and other choices may also be applicable. If {x(0), x(1),..., x(N 1)} is the set of samples of a short-term frame, then the adopted features are given by: 1. Short-term Energy: This is a popular feature, defined by the equation E = 1 N N1 n=0 x 2 (n). 2. Chroma-Vector based features: The chroma vector has been widely used in various music information retrieval applications, e.g., [7]. It can be computed from the mag- nitude of the DFT of each short-term window, if the DFT coefficients are grouped into 12 bins, where each bin rep- resents one of the 12 pitch classes of western-type mu- sic (semitone spacing). In this paper, two sequences of chroma vectors are extracted, using different mid-term window sizes. Each chroma sequence serves as the basis to compute a feature value over time, namely: Chroma-based Feature 1: The audio stream is parsed with a non-overlapping mid-term window of length 100msecs. For each frame, the chroma vector is extracted and the standard deviation of its twelve coefficients is computed, yielding a one-dimensional feature over time. Our study revealed that the mean value of this feature is ©2007 EURASIP 1226 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP