SOUND RECOGNITION: A CONNECTIONIST APPROACH Hadi Harb, Liming Chen LIRIS CNRS FRE 2672 Ecole Centrale de Lyon Dépt. Mathématiques Informatiques, 36 avenue Guy de Collongue 69134 Ecully, France {Hadi.Harb, Liming.Chen}@ec-lyon.fr ABSTRACT This paper presents a general audio classification approach inspired by our modest knowledge about the human perception of sound. Simple psychoacoustic experiments show that the relation between short term spectral features has a great impact on the human audio classification performance. For instance, short term spectral features extracted from speech sound can be perceived as non-speech sounds if organized in a special way in time. We have developed the idea of incorporating several consecutive spectral features when modelling the audio signal in relatively long term time windows. The modelling scheme that we propose, Piecewise Gaussian Modelling (PGM), was combined with a neural network to develop a general audio classifier. The classifier was evaluated on the problems of speech/music classification, male/female classification and special events detection in sports videos. The good classification accuracy obtained by the classifier suggests us to continue the research in order to improve the model and to closely combine it to some well-known psychoacoustic experimental results. 1. INTRODUCTION Sound recognition consists of classifying the audio signal into semantic classes. Examples of sound recognition include: speech/music classification, speaker recognition, speaker gender recognition, music genre recognition etc. Sound recognition is one important step in the emerging MPEG7 standard. However, to our knowledge, no single technique presented in the literature is sufficiently effective for several audio classification problems. For instance, a speech/music classification technique is not a good choice for speaker gender recognition. Surprisingly enough, researchers used to build audio classification systems with no or little relation to the human perception. Knowing that the semantic audio classes are created and perceived by humans it is important to be inspired by our modest knowledge about human perception when building sound recognition systems. Humans perform sound recognition using the same features, for instance frequency spectrum-like features, and the same technique which is the human cortex [1]. This research aims at providing a general approach for audio classification inspired by the human perception of sound. 2. APPROACH The sound spectrum is shown to be an important feature for audio classification. Almost all audio classification systems rely on the spectrum to achieve the classification. It is also shown that the inner ear performs some spectral-like analysis before sending such information to the cortex via auditory nerves [1]. However, technically speaking, one spectral vector is extracted every 10ms. One question arises: is it sufficient for humans 10 ms of audio to perform a general sound classification? Simple experiments on the human capability for audio classification, for instance speech/music discrimination, can show that humans need more than 200 ms approximately to achieve a good classification performance. Furthermore, arranging several 10 ms speech segments in a special way in time can give the impression of a non-speech sound (www.ec- lyon.fr/perso/Hadi_Harb/Demos.htm). This leads us to seriously consider the effect of the context or the relation between short term audio excerpts on the perceived class. An attempt to model spectral-like features by Gaussian Mixture Models with no context information was done by [2], and [3]. The reported results show the non-efficiency of such an approach for general audio classification. We propose to model several neighbouring spectral vectors using one model to incorporate the context. We have investigated modelling spectral vectors in relatively large windows “T” of time, for instance T > 250ms, by a one Gaussian Model. That is, in each “T” window the mean and the variance of the spectral features in each frequency channel are extracted, 0-7803-7946-2/03/$17.00 ©2003 IEEE. ISSPA 2003