Music Classification using Extreme Learning Machines Simone Scardapane * , Danilo Comminiello * , Michele Scarpiniti * and Aurelio Uncini * * Department of Information Engineering, Electronics and Telecommunications (DIET), “Sapienza” University of Rome, Via Eudossiana 18, 00184, Rome. Emails: {simone.scardapane, danilo.comminiello, michele.scarpiniti}@uniroma1.it; aurel@ieee.org Abstract—Over the last years, automatic music classification has become a standard benchmark problem in the machine learning community. This is partly due to its inherent difficulty, and also to the impact that a fully automated classification system can have in a commercial application. In this paper we test the efficiency of a relatively new learning tool, Extreme Learning Machines (ELM), for several classification tasks on publicly available song datasets. ELM is gaining increasing attention, due to its versatility and speed in adapting its internal parameters. Since both of these attributes are fundamental in music classifi- cation, ELM provides a good alternative to standard learning models. Our results support this claim, showing a sustained gain of ELM over a feedforward neural network architecture. In particular, ELM provides a great decrease in computational training time, and has always higher or comparable results in terms of efficiency. I. I NTRODUCTION The increased availability of musical content and user- generated annotations associated to that content has made Automatic Music Retrieval (AMR) a tool of fundamental importance for music applications. As an example Spotify 1 , one of the biggest web applications for music streaming, announced last year to have reached an overall catalog of more than 20 million songs. Selecting songs from this database to provide a good experience to the end users results extremely challenging. AMR is hence the problem of efficiently retriev- ing songs that may be of interest to the end users depending on a given set of predefined criteria. Automatic Music Classification (AMC) is one of the main problems in AMR. Clearly, as long as we are able to correctly classify a set of songs, we can use the resulting groups as a tool to satisfy a user-defined query. Each song may be classified according to several dimensions of interest, including genre, perceived mood, artist, presence of a given instrument, and several others. Fu et al. [1] provides an interesting overview of the field, by reviewing most of the relevant papers and techniques. Despite all the efforts, however, results are still far from being optimal, due to the inherent difficulty of the problem. Consider for example the following aspects: 1) A standard audio file comprises several thousands of samples, subdivided in one, two or more channels. Despite there exists a large set of possible features that can be extracted from a single track, it is a difficult task to select an optimal subset with respect 1 https://www.spotify.com/ to the task at hand. We delve into this point in more detail in Section II. 2) In some cases, the task may be challenging also for a human expert, due to the high degree of subjectivity and required knowledge involved. This is evident, for example, in the case of efficient genre classification. 3) Moreover, good accuracy may require large databases of thousands of songs. This results in several giga- bytes of data to be elaborated, hence imposing a strong computational effort for the training of the learning models. All these aspects are worsened when we include in our data user-generated content relative to each song. Consider again Spotify: being a social website, each track is typically annotated with genre, artist, tags and other related information by many users of the application. Moreover, data from several websites may be easily retrieved and aggregated using the provided programming interfaces. Overall, this amount in an extremely large mass of information on which efficient data mining is challenging. These reasons are making AMC tasks an interesting benchmark for machine learning tools. For example the MIREX challenge [2] has seen a constant growth over the last years, and today comprises more that fifteen different tasks regarding AMC. In this paper we test Extreme Learning Machines (ELM) [3] on several audio-related benchmarks. ELM is a relatively new learning technique that we believe of great interest for audio classification. In particular, ELM models are highly versatile (providing a unified solution for both multi-class classification and regression), and are much faster to train than standard models such as neural networks. The main idea of ELM is projecting the original input into an highly dimensional feature space, where a linear model is subsequently applied. The peculiarity is that this new space is fully fixed before observing the data, hence the actual learning consist of a simple linear regression that can be computed efficiently in closed form. At this time, we are aware of only two works that have used ELM for music classification. In [4], ELM is applied to the problem of genre classification, on a author-generated dataset. Out of nine tests, ELM has a greater average accuracy than a standard Support Vector Machine. Then, in [5] the authors tested ELM for the classification of Han Chinese folk songs, together with a novel musical encoding method they call MFDMap. However, no comparisons are made with other classifiers. Thus, no work has been done up to now 8th International Symposium on Image and Signal Processing and Analysis (ISPA 2013) September 4-6, 2013, Trieste, Italy Signal Processing Speech, Music, and Audio Processing 377