A DYNAMIC PROGRAMMING APPROACH TO SPEECH/MUSIC DISCRIMINATION
OF RADIO RECORDINGS
Aggelos Pikrakis, Theodoros Giannakopoulos and Sergios Theodoridis
Dept. of Informatics and Telecommunications, University of Athens, Greece
e-mail: {pikrakis, tyiannak, stheodor}@di.uoa.gr, URL: http://www.di.uoa.gr/dsp
ABSTRACT
This paper treats speech/music discrimination of radio
recordings as a maximization task, where the solution is ob-
tained by means of dynamic programming. The proposed
method seeks the sequence of segments and respective class
labels (i.e., speech/music) that maximize the product of pos-
terior class label probabilities, given the within the segments
data. To this end, a Bayesian Network combiner is embed-
ded as a posterior probability estimator. Tests have been per-
formed using a large set of radio recordings with several mu-
sic genres. The experiments show that the proposed scheme
leads to an overall performance of 92.32%. Experiments are
also reported on a genre basis and a comparison with exist-
ing methods is given.
1. INTRODUCTION
Speech/Music discrimination refers to the problem of seg-
menting an audio stream and labeling each segment as either
speech or music. Since the first attempts in the mid 90’s, a
number of speech / music discrimination systems have been
proposed in various application fields.
In [1], a real-time technique for speech/music discrim-
ination was proposed, focusing on the automatic monitor-
ing of radio stations, using features related to the short-term
energy and zero-crossing rate (ZCR). In [2], thirteen audio
features were used in order to train different types of mul-
tidimensional classifiers, such as a Gaussian MAP estima-
tor and a nearest neighbor classifier. In [3], energy, ZCR
and fundamental frequency were used as features in order
to achieve analysis of on-line audiovisual data. Segmen-
tation/classification was achieved by means of a procedure
based on heuristic rules. A framework based on a combina-
tion of standard Hidden Markov Models and Multilayer Per-
ceptrons (MLP) was used in [4] for speech/music discrim-
ination of broadcast news. An Adaboost - based algorithm,
applied on the spectrogram of the audio samples, was used
in [5] for frame-level discrimination of speech and music. In
[6], energy and ZCR were employed as features and classifi-
cation was achieved by means of a set of heuristic criteria in
an attempt to exploit the nature of speech and music signals.
The majority of the previously described methods deal
with the problem of speech/music discrimination in two sep-
arate steps: first, the audio signal is split into segments by
detecting abrupt changes in the signal statistics and at a sec-
ond step the extracted segments are classified as speech or
music by using standard classification schemes. The work
in [4] differs in the sense that the two tasks are performed
jointly by means of a standard HMM, where, for each state,
a MLP is used as an estimator of the continuous observation
densities required by the HMM.
The method that we propose in this paper formulates
speech/music discrimination as a maximization task. In other
words, the method seeks the sequence of segments and the
respective class labels (i.e., speech/music) that maximizes
the product of posterior (class label) probabilities, given the
segments data. In order to estimate the required posterior
probabilities, a Bayesian Network (BN) Combiner is trained
and used. Since an exhaustive approach to this solution is
unrealistic, we resort to dynamic programming to solve this
maximization task.
Section 2 describes the feature extraction stage. Section
2.1 formulates speech/music discrimination as a maximiza-
tion task and provides a dynamic programming solution. The
BN combiner architecture and related issues are given in Sec-
tion 2.2. The datasets that we have used, the method’s perfor-
mance (both average and on a radio genre basis), as well as
a comparison with other approaches are presented in Section
3.
2. FEATURE EXTRACTION
At a first step, the audio recording is broken into a se-
quence of non-overlapping short-term frames and five au-
dio features are extracted per frame. At the end of this
feature extraction stage, the audio recording is represented
by a sequence F of five-dimensional feature vectors, i.e.,
F = {O
1
, O
2
,..., O
T
}, where T is the number of short-term
frames. The specific choice of features was the result of ex-
tensive experimentation. It must be emphasized that this is
not an optimal feature set in any sense, and other choices may
also be applicable. If {x(0), x(1),..., x(N − 1)} is the set of
samples of a short-term frame, then the adopted features are
given by:
1. Short-term Energy: This is a popular feature, defined
by the equation E =
1
N
∑
N−1
n=0
x
2
(n).
2. Chroma-Vector based features: The chroma vector has
been widely used in various music information retrieval
applications, e.g., [7]. It can be computed from the mag-
nitude of the DFT of each short-term window, if the DFT
coefficients are grouped into 12 bins, where each bin rep-
resents one of the 12 pitch classes of western-type mu-
sic (semitone spacing). In this paper, two sequences of
chroma vectors are extracted, using different mid-term
window sizes. Each chroma sequence serves as the basis
to compute a feature value over time, namely:
Chroma-based Feature 1: The audio stream is parsed
with a non-overlapping mid-term window of length
100msecs. For each frame, the chroma vector is extracted
and the standard deviation of its twelve coefficients is
computed, yielding a one-dimensional feature over time.
Our study revealed that the mean value of this feature is
©2007 EURASIP 1226
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP