Comparing Timbre-based Features for Musical Genre Classiﬁcation Martin Hartmann, Pasi Saari, Petri Toiviainen and Olivier Lartillot Finnish Centre of Excellence in Interdisciplinary Music Research, Department of Music, University of Jyv¨ askyl¨ a [firstname].[lastname]@jyu.fi ABSTRACT People can accurately classify music based on its style by listening to less than half a second of audio. This has motivated efforts to build accurate predictive models of musical genre based upon short-time musical descriptions. In this context, perceptually relevant features have been considered crucial but only little research has been con- ducted in this direction. This study compared two tim- bral features for supervised classiﬁcation of musical gen- res: 1) the Mel-Frequency Cepstral Coefﬁcients (MFCC), coming from the speech domain and widely used for mu- sic modeling purposes; and 2) the more recent Sub-Band Flux (SBF) set of features which has been designed specif- ically for modeling human perception of polyphonic mu- sical timbre. Differences in performance between models were found, suggesting that the SBF feature set is more ap- propriate for musical genre classiﬁcation than the MFCC set. In addition, spectral ﬂuctuations at both ends of the frequency spectrum were found to be relevant for discrim- ination between musical genres. The results of this study give support to the use of perceptually motivated features for musical genre classiﬁcation. Introduction Humans are very accurate at arranging music into genre classes, even when pieces were listened for the ﬁrst time. Further, the correct genre might not be known by listeners, but they could still afﬁrm to what genres a piece of music would deﬁnitely not belong to. In fact, less than half a sec- ond of music is enough information for people to classify the type of music with great accuracy and identify other information such as title and artist [1, 2]. This brings the question of how people perceive and rec- ognize musical styles and what are the descriptions in the music that make it possible to categorically decide that a given song belongs to a speciﬁc genre. In other words, the question is how can humans conﬁdently build hypotheses about the style of musical pieces based on such a limited evidence. It seems that the vertical structure of the mu- sic or short-time descriptions of musical polyphonic timbre could help us to understand these fascinating perceptual Copyright: c 2013 Martin Hartmann et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 Unported License , which permits unre- stricted use, distribution, and reproduction in any medium, provided the original author and source are credited. processes. However, it is not easy to build accurate pre- dictive models of musical higher-level knowledge based on musical timbre descriptions. One reason is the lack of an acoustic explanation of polyphonic timbre. Pitch and loudness can be described as high or low, but musi- cal timbre cannot be directly measured this way since it is possibly composed of multiple perceptual dimensions [3], such as dryness, brightness, or fullness. A second reason for this difﬁculty refers to the indirect path between mu- sical descriptors and what is actually understood by hu- mans about the musical content. In the particular case of content-based music information retrieval (MIR), this “se- mantic gap” refers to the insufﬁciency of low-level infor- mation extracted from the musical signal to arrange mu- sic based on cultural meanings and interpretations shared by communities [4]. Despite these problems, plenty of ap- proaches to music genre classiﬁcation have been suggested for more than a decade. The aim of this study is to compare the performance of two timbre-based features for supervised music genre clas- siﬁcation. The mel-frequency cepstral coefﬁcients (MFCC) [5] come from the domain of speech and have been widely used for multiple music modeling purposes, whereas the sub-band ﬂux set of features (SBF) has been recently sug- gested [6] and it is designed speciﬁcally for musical poly- phonic timbre modeling. A main premise in this study is that perceptually relevant timbre-based features can help us understand better the acoustic foundation of polyphonic musical timbre and alleviate the constraints of the seman- tic gap in music genre classiﬁcation. The performance of these two descriptors was comprehensively inspected us- ing different data sets, feature combinations and learning algorithms for feature selection and classiﬁcation. 1. BACKGROUND Genre classiﬁcation is widely studied in MIR perhaps be- cause musical genres have been historically important in music stores and libraries for categorization based on es- sential similarities. In the digital era, automatic genre clas- siﬁcation offers applications outside scientiﬁc areas, for example in radio playlists, music database systems or for content tagging in social networking services. The task of genre classiﬁcation has been reviewed, for ex- ample, in [7]. A great variety of musical features has been evaluated for music genre classiﬁcation based on audio signal. Commonly extracted features are timbral, rhyth- mic and melodic [7]. The best results for this task seem to be obtained using timbre-based feature extraction. For example [8] obtained one of the highest performances for Proceedings of the Sound and Music Computing Conference 2013, SMC 2013, Stockholm, Sweden 707