Hindawi Publishing Corporation
EURASIP Journal on Audio, Speech, and Music Processing
Volume 2009, Article ID 239892, 14 pages
doi:10.1155/2009/239892
Research Article
A Decision-Tree-Based Algorithm for Speech/Music
Classification and Segmentation
Yizhar Lavner
1
and Dima Ruinskiy
1, 2
1
Department of Computer Science, Tel-Hai College, Tel-Hai 12210, Israel
2
Israeli Development Center, Intel Corporation, Haifa 31015, Israel
Correspondence should be addressed to Yizhar Lavner, yizhar l@kyiftah.org.il
Received 10 September 2008; Revised 5 January 2009; Accepted 27 February 2009
Recommended by Climent Nadeu
We present an efficient algorithm for segmentation of audio signals into speech or music. The central motivation to our study
is consumer audio applications, where various real-time enhancements are often applied. The algorithm consists of a learning
phase and a classification phase. In the learning phase, predefined training data is used for computing various time-domain and
frequency-domain features, for speech and music signals separately, and estimating the optimal speech/music thresholds, based
on the probability density functions of the features. An automatic procedure is employed to select the best features for separation.
In the test phase, initial classification is performed for each segment of the audio signal, using a three-stage sieve-like approach,
applying both Bayesian and rule-based methods. To avoid erroneous rapid alternations in the classification, a smoothing technique
is applied, averaging the decision on each segment with past segment decisions. Extensive evaluation of the algorithm, on a database
of more than 12 hours of speech and more than 22 hours of music showed correct identification rates of 99.4% and 97.8%,
respectively, and quick adjustment to alternating speech/music sections. In addition to its accuracy and robustness, the algorithm
can be easily adapted to different audio types, and is suitable for real-time operation.
Copyright © 2009 Y. Lavner and D. Ruinskiy. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1. Introduction
In the past decade a vast amount of multimedia data, such
as text, images, video, and audio has become available.
Efficient organization and manipulation of this data are
required for many tasks, such as data classification for storage
or navigation, differential processing according to content,
searching for specific information, and many others.
A large portion of the data is audio, from resources such
as broadcasting channels, databases, internet streams, and
commercial CDs. To answer the fast-growing demands for
handling the data, a new field of research, known as audio
content analysis (ACA), or machine listening, has recently
emerged, with the purpose of analyzing the audio data and
extracting the content information directly from the acoustic
signal [1] to the point of creating a “Table of Contents” [2].
Audio data (e.g., from broadcasting) often contains
alternating sections of different types, such as speech and
music. Thus, one of the fundamental tasks in manipulating
such data is speech/music discrimination and segmentation,
which is often the first step in processing the data. Such
preprocessing is desirable for applications requiring accurate
demarcation of speech, for instance automatic transcription
of broadcast news, speech and speaker recognition, word
or phrase spotting, and so forth. Similarly, it is useful in
applications where attention is given to music, for example,
genre-based or mood-based classification.
Speech/music classification is also important for applica-
tions that apply differential processing to audio data, such as
content-based audio coding and compressing or automatic
equalization of speech and music. Finally, it can also serve
for indexing other data, for example, classification of video
content through the accompanying audio.
One of the challenges in speech/music discrimination
is characterization of the music signal. Speech is composed
from a selection of fairly typical sounds and as such, can be
represented well by relatively simple models. On the other
hand, the assortment of sounds in music is much broader