Herrera et al. Percussion-related semantic descriptors of music audio files AES 25 th International Conference, London, United Kingdom, 2004 June 17–19 1 PERCUSSION-RELATED SEMANTIC DESCRIPTORS OF MUSIC AUDIO FILES PERFECTO HERRERA 1 , VEGARD SANDVOLD 2 , AND FABIEN GOUYON 1 1 Universitat Pompeu Fabra , Barcelona, Spain pherrera@iua.upf.es , fgouyon@iua.upf.es 2 University of Oslo, Oslo, Norway vegardsa@ifi.uio.no Automatic extraction of semantic music content descriptors has traditionally focused on melodic, rhythmic and harmonic aspects. In the present paper, we will present several music content descriptors that are related to percussion instrumentation. The “percussion index” estimates the amount of percussion that can be found in a music audio file and yields a (numerical or categorical) value that represents the amount of percussion detected in the file. A further refinement is the “percussion profile”, which roughly indicates the existing balance between drums and cymbals. We finally present the percussivity descriptor, which represents the overall impulsiveness or abruptness of the percussive events. Data from initial evaluations, both objective and subjective will also be presented and discussed. INTRODUCTION Automatic extraction of music content metadata has traditionally focused on melodic, rhythmic and harmonic aspects. On the contrary, timbre or instrumentation descriptors have been traditionally missing from research agendas. Among the most content-informative instrumentation-related features, we find those that can be extracted by focusing on percussive events. The mainstream approach to music content processing from audio files is the one we term the transcriptionist approach. According to this approach, describing music content equates to extracting a score-like representation of the original audio. Source separation is also a natural strategy under this approach. But using a score as a ground truth for matching the output of a content processing system makes sense only when the intended user of the system is a musically-educated one. This is not the case with most of the users of existing music downloading systems , which amount probably more than ninety percent of the users of music retrieval systems . In contrast with this transcriptionist view, we advocate here for a descriptionist approach, which has also been advocated elsewhere by Martin et al. [1], or Carreras and Leman [2]. The descriptionist approach is an ecological and user-centred way of addressing the description of music contents from audio files. It is an ecological approach because the research context is that of a system in use, the structure and functionalities of which will pose specific problems and will shape the knowledge structures of the users. It is a user-centered approach because the attempted solutions spring from the user needs and requirements, and not by a pre- existing musical theoretical construct. According to this aproach, we advance several percussion-related descriptors that we have named percussion index, percussion profile, kick -snare crossings , and percussivity. They do not correspond to solid musical theoretical entities, but we suggest that, on the other hand, they correspond to entities that are (or can be) represented in the minds of the users of music information retrieval systems . Because of that, they can be exploited, taken one by one or combining them synergistically, to define and refine query and retrieval operations of music files. Although percussion has been traditionally the poor relative in music or signal processing research, in the last two years we have witnessed a growing wealth of papers focusing on it, mainly with a focus on transcription. Goto and Murakoa [3] studied drum sound classification in the context of source separation and beat tracking [4]. They implemented an “energy profile”-based snare-kick discriminator, though no effectiveness evaluation was provided. More recently, Zils et al. [5] reported very good performance rate at identifying kicks and snares in songs by means of a technique of analysis and incremental refinement of synthesis that was originally developed by Gouyon [6]. Jørgensen [7] attempted to use cross-correlation between sound templates extracted from isolated sound recordings and realistic drum-kit recordings. Using this technique only kicks and snares seem to be detected with some reliability. A very different motivation has been that of Kragtwijk et al. [8] who have presented a 3D virtual drummer that re-creates with synthetic images the playing movements of a real drummer after