Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 2009, Article ID 239892, 14 pages doi:10.1155/2009/239892 Research Article A Decision-Tree-Based Algorithm for Speech/Music Classiﬁcation and Segmentation Yizhar Lavner 1 and Dima Ruinskiy 1, 2 1 Department of Computer Science, Tel-Hai College, Tel-Hai 12210, Israel 2 Israeli Development Center, Intel Corporation, Haifa 31015, Israel Correspondence should be addressed to Yizhar Lavner, yizhar l@kyiftah.org.il Received 10 September 2008; Revised 5 January 2009; Accepted 27 February 2009 Recommended by Climent Nadeu We present an eﬃcient algorithm for segmentation of audio signals into speech or music. The central motivation to our study is consumer audio applications, where various real-time enhancements are often applied. The algorithm consists of a learning phase and a classiﬁcation phase. In the learning phase, predeﬁned training data is used for computing various time-domain and frequency-domain features, for speech and music signals separately, and estimating the optimal speech/music thresholds, based on the probability density functions of the features. An automatic procedure is employed to select the best features for separation. In the test phase, initial classiﬁcation is performed for each segment of the audio signal, using a three-stage sieve-like approach, applying both Bayesian and rule-based methods. To avoid erroneous rapid alternations in the classiﬁcation, a smoothing technique is applied, averaging the decision on each segment with past segment decisions. Extensive evaluation of the algorithm, on a database of more than 12 hours of speech and more than 22 hours of music showed correct identiﬁcation rates of 99.4% and 97.8%, respectively, and quick adjustment to alternating speech/music sections. In addition to its accuracy and robustness, the algorithm can be easily adapted to diﬀerent audio types, and is suitable for real-time operation. Copyright © 2009 Y. Lavner and D. Ruinskiy. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. Introduction In the past decade a vast amount of multimedia data, such as text, images, video, and audio has become available. Eﬃcient organization and manipulation of this data are required for many tasks, such as data classiﬁcation for storage or navigation, diﬀerential processing according to content, searching for speciﬁc information, and many others. A large portion of the data is audio, from resources such as broadcasting channels, databases, internet streams, and commercial CDs. To answer the fast-growing demands for handling the data, a new ﬁeld of research, known as audio content analysis (ACA), or machine listening, has recently emerged, with the purpose of analyzing the audio data and extracting the content information directly from the acoustic signal [1] to the point of creating a “Table of Contents” [2]. Audio data (e.g., from broadcasting) often contains alternating sections of diﬀerent types, such as speech and music. Thus, one of the fundamental tasks in manipulating such data is speech/music discrimination and segmentation, which is often the ﬁrst step in processing the data. Such preprocessing is desirable for applications requiring accurate demarcation of speech, for instance automatic transcription of broadcast news, speech and speaker recognition, word or phrase spotting, and so forth. Similarly, it is useful in applications where attention is given to music, for example, genre-based or mood-based classiﬁcation. Speech/music classiﬁcation is also important for applica- tions that apply diﬀerential processing to audio data, such as content-based audio coding and compressing or automatic equalization of speech and music. Finally, it can also serve for indexing other data, for example, classiﬁcation of video content through the accompanying audio. One of the challenges in speech/music discrimination is characterization of the music signal. Speech is composed from a selection of fairly typical sounds and as such, can be represented well by relatively simple models. On the other hand, the assortment of sounds in music is much broader