Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 2009, Article ID 239892, 14 pages doi:10.1155/2009/239892 Research Article A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation Yizhar Lavner 1 and Dima Ruinskiy 1, 2 1 Department of Computer Science, Tel-Hai College, Tel-Hai 12210, Israel 2 Israeli Development Center, Intel Corporation, Haifa 31015, Israel Correspondence should be addressed to Yizhar Lavner, yizhar l@kyiftah.org.il Received 10 September 2008; Revised 5 January 2009; Accepted 27 February 2009 Recommended by Climent Nadeu We present an ecient algorithm for segmentation of audio signals into speech or music. The central motivation to our study is consumer audio applications, where various real-time enhancements are often applied. The algorithm consists of a learning phase and a classification phase. In the learning phase, predefined training data is used for computing various time-domain and frequency-domain features, for speech and music signals separately, and estimating the optimal speech/music thresholds, based on the probability density functions of the features. An automatic procedure is employed to select the best features for separation. In the test phase, initial classification is performed for each segment of the audio signal, using a three-stage sieve-like approach, applying both Bayesian and rule-based methods. To avoid erroneous rapid alternations in the classification, a smoothing technique is applied, averaging the decision on each segment with past segment decisions. Extensive evaluation of the algorithm, on a database of more than 12 hours of speech and more than 22 hours of music showed correct identification rates of 99.4% and 97.8%, respectively, and quick adjustment to alternating speech/music sections. In addition to its accuracy and robustness, the algorithm can be easily adapted to dierent audio types, and is suitable for real-time operation. Copyright © 2009 Y. Lavner and D. Ruinskiy. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. Introduction In the past decade a vast amount of multimedia data, such as text, images, video, and audio has become available. Ecient organization and manipulation of this data are required for many tasks, such as data classification for storage or navigation, dierential processing according to content, searching for specific information, and many others. A large portion of the data is audio, from resources such as broadcasting channels, databases, internet streams, and commercial CDs. To answer the fast-growing demands for handling the data, a new field of research, known as audio content analysis (ACA), or machine listening, has recently emerged, with the purpose of analyzing the audio data and extracting the content information directly from the acoustic signal [1] to the point of creating a “Table of Contents” [2]. Audio data (e.g., from broadcasting) often contains alternating sections of dierent types, such as speech and music. Thus, one of the fundamental tasks in manipulating such data is speech/music discrimination and segmentation, which is often the first step in processing the data. Such preprocessing is desirable for applications requiring accurate demarcation of speech, for instance automatic transcription of broadcast news, speech and speaker recognition, word or phrase spotting, and so forth. Similarly, it is useful in applications where attention is given to music, for example, genre-based or mood-based classification. Speech/music classification is also important for applica- tions that apply dierential processing to audio data, such as content-based audio coding and compressing or automatic equalization of speech and music. Finally, it can also serve for indexing other data, for example, classification of video content through the accompanying audio. One of the challenges in speech/music discrimination is characterization of the music signal. Speech is composed from a selection of fairly typical sounds and as such, can be represented well by relatively simple models. On the other hand, the assortment of sounds in music is much broader