SPEECH AND MUSIC CLASSIFICATION IN AUDIO DOCUMENTS Julien Pinquier, Christine Sénac and Régine André-Obrecht Institut de Recherche en Informatique de Toulouse, UMR 5505 CNRS – INP – UPS 118, Route de Narbonne, 31062 Toulouse Cedex 04, FRANCE http://www.irit.fr/ACTIVITES/EQ_ARTPS {pinquier, senac, obrecht}@irit.fr ABSTRACT To index efficiently the soundtrack of multimedia documents, it is necessary to extract elementary and homogeneous acoustic segments. In this paper, we explore such a prior partitioning which consists in detect the two basic components, which are speech and music components. The originality of this work is that music and speech are not considered as two classes and two classification systems are independently defined, a speech/non-speech one and a music/non-music one. This approach permits to better characterize and discriminate each component: in particular, two different feature spaces are necessary as two pairs of Gaussian mixture models. More, the acoustic signal is divided into four types of segments: speech, music, speech-music and other. The experiments are performed on the soundtracks of audio video documents (films, TV sport broadcasts). The performance proves the interest of this approach, so called the Differentiated Modeling Approach. 1. INTRODUCTION With the fast growth of audio and multimedia information, the number of documents, such as broadcast radio and television increases greatly and the development of technologies for spoken document indexing and retrieval is in full expansion. Commonly, to describe a sound document, key words, key sounds (jingles) or melodies are semi-automatically extracted, speakers are detected; more recently, the problem of topics retrieval has been studied [1]. Nevertheless all these detection systems presuppose the extraction of elementary and homogeneous acoustic components. When the study addresses speech [2] (respectively music [3]) indexing, speech (respectively music) segments are selected; the other segments are rejected. Of course, the two detections are not studied with the same attention. In this paper, we explore a prior partitioning which consists in detect the two basic components, which are speech and music components, with an equal performance. In that purpose, music and speech are not considered as two classes (two classification systems are independently defined, a speech/non-speech one and a music/non-music one) and more, two observation spaces are used. We called this approach, the Differentiated Modeling Approach to emphasize the fact it is necessary to better characterize and discriminate each component through its own feature space and its own statistical modeling. The first part of this paper points out the benefit of the Differentiated Modeling; we precise the two different feature spaces and the using of the Gaussian mixture models in each case. In the second part, we describe some experiments which are performed on the soundtracks of audio video documents (TV movies, sport reports). The performance proves the interest of this approach. 2. THE SEPARATION OF SPEECH AND MUSIC As we have said previously, in a classic approach of discrimination between speech and music components, the choice is binary. Among the classic methods, some authors who belong to the musician community, have given greater importance to features which increase this binary discrimination: for example, the zero crossing rate and the spectral centroïd are used to separate voiced speech from noisy sounds [4], the variation of the spectrum magnitude (the spectral "Flux") attempts to detect harmonic continuity [5]. Authors who study automatic speech processing, have preferred cepstral features [2]. Two concurrent classification frameworks are usually investigated, the Gaussian Mixture Model framework and the k-nearest-neighbor one [6].