Music Classiﬁcation using Extreme Learning Machines Simone Scardapane * , Danilo Comminiello * , Michele Scarpiniti * and Aurelio Uncini * * Department of Information Engineering, Electronics and Telecommunications (DIET), “Sapienza” University of Rome, Via Eudossiana 18, 00184, Rome. Emails: {simone.scardapane, danilo.comminiello, michele.scarpiniti}@uniroma1.it; aurel@ieee.org Abstract—Over the last years, automatic music classiﬁcation has become a standard benchmark problem in the machine learning community. This is partly due to its inherent difﬁculty, and also to the impact that a fully automated classiﬁcation system can have in a commercial application. In this paper we test the efﬁciency of a relatively new learning tool, Extreme Learning Machines (ELM), for several classiﬁcation tasks on publicly available song datasets. ELM is gaining increasing attention, due to its versatility and speed in adapting its internal parameters. Since both of these attributes are fundamental in music classiﬁ- cation, ELM provides a good alternative to standard learning models. Our results support this claim, showing a sustained gain of ELM over a feedforward neural network architecture. In particular, ELM provides a great decrease in computational training time, and has always higher or comparable results in terms of efﬁciency. I. I NTRODUCTION The increased availability of musical content and user- generated annotations associated to that content has made Automatic Music Retrieval (AMR) a tool of fundamental importance for music applications. As an example Spotify 1 , one of the biggest web applications for music streaming, announced last year to have reached an overall catalog of more than 20 million songs. Selecting songs from this database to provide a good experience to the end users results extremely challenging. AMR is hence the problem of efﬁciently retriev- ing songs that may be of interest to the end users depending on a given set of predeﬁned criteria. Automatic Music Classiﬁcation (AMC) is one of the main problems in AMR. Clearly, as long as we are able to correctly classify a set of songs, we can use the resulting groups as a tool to satisfy a user-deﬁned query. Each song may be classiﬁed according to several dimensions of interest, including genre, perceived mood, artist, presence of a given instrument, and several others. Fu et al. [1] provides an interesting overview of the ﬁeld, by reviewing most of the relevant papers and techniques. Despite all the efforts, however, results are still far from being optimal, due to the inherent difﬁculty of the problem. Consider for example the following aspects: 1) A standard audio ﬁle comprises several thousands of samples, subdivided in one, two or more channels. Despite there exists a large set of possible features that can be extracted from a single track, it is a difﬁcult task to select an optimal subset with respect 1 https://www.spotify.com/ to the task at hand. We delve into this point in more detail in Section II. 2) In some cases, the task may be challenging also for a human expert, due to the high degree of subjectivity and required knowledge involved. This is evident, for example, in the case of efﬁcient genre classiﬁcation. 3) Moreover, good accuracy may require large databases of thousands of songs. This results in several giga- bytes of data to be elaborated, hence imposing a strong computational effort for the training of the learning models. All these aspects are worsened when we include in our data user-generated content relative to each song. Consider again Spotify: being a social website, each track is typically annotated with genre, artist, tags and other related information by many users of the application. Moreover, data from several websites may be easily retrieved and aggregated using the provided programming interfaces. Overall, this amount in an extremely large mass of information on which efﬁcient data mining is challenging. These reasons are making AMC tasks an interesting benchmark for machine learning tools. For example the MIREX challenge [2] has seen a constant growth over the last years, and today comprises more that ﬁfteen different tasks regarding AMC. In this paper we test Extreme Learning Machines (ELM) [3] on several audio-related benchmarks. ELM is a relatively new learning technique that we believe of great interest for audio classiﬁcation. In particular, ELM models are highly versatile (providing a uniﬁed solution for both multi-class classiﬁcation and regression), and are much faster to train than standard models such as neural networks. The main idea of ELM is projecting the original input into an highly dimensional feature space, where a linear model is subsequently applied. The peculiarity is that this new space is fully ﬁxed before observing the data, hence the actual learning consist of a simple linear regression that can be computed efﬁciently in closed form. At this time, we are aware of only two works that have used ELM for music classiﬁcation. In [4], ELM is applied to the problem of genre classiﬁcation, on a author-generated dataset. Out of nine tests, ELM has a greater average accuracy than a standard Support Vector Machine. Then, in [5] the authors tested ELM for the classiﬁcation of Han Chinese folk songs, together with a novel musical encoding method they call MFDMap. However, no comparisons are made with other classiﬁers. Thus, no work has been done up to now 8th International Symposium on Image and Signal Processing and Analysis (ISPA 2013) September 4-6, 2013, Trieste, Italy Signal Processing Speech, Music, and Audio Processing 377