TEMPORAL POOLING AND MULTISCALE LEARNINGFOR AUTOMATIC ANNOTATION AND RANKING OF MUSIC AUDIO Philippe Hamel, Simon Lemieux, Yoshua Bengio and Douglas Eck DIRO, Université de Montréal CIRMMT {hamelphi,lemiesim,bengioy,eckdoug}@iro.umontreal.ca ABSTRACT This paper analyzes some of the challenges in performing automatic annotation and ranking of music audio, and pro- poses a few improvements. First, we motivate the use of principal component analysis on the mel-scaled spectrum. Secondly, we present an analysis of the impact of the selec- tion of pooling functions for summarization of the features over time. We show that combining several pooling func- tions improves the performance of the system. Finally, we introduce the idea of multiscale learning. By incorporating these ideas in our model, we obtained state-of-the-art per- formance on the Magnatagatune dataset. 1. INTRODUCTION In this paper, we consider the tasks of automatic annotation and ranking of music audio. Automatic annotation consists of assigning relevant word descriptors, or tags, to a given music audio clip. Ranking, on the other hand, consists of finding an audio clip that best corresponds to a given tag, or set of tags. These descriptors are able to represent a wide range of semantic concepts such as genre, mood, instrumen- tation, etc. Thus, a set of tags provides a high-level descrip- tion of an audio clip. This information is useful for tasks like music recommendation, playlist generation and measuring music similarity. In order to solve automatic annotation and ranking, we need to build a system that can extract relevant features from music audio and infer abstract concepts from these features. Many content-based music recommendation systems follow the same recipe with minor variations (see [5] for a review). First, some features are extracted from the audio. Then, these features are summarized over time. Finally, a classifi- cation model is trained over the summarized features to ob- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. c 2011 International Society for Music Information Retrieval. tain tag affinities. We describe several previous approaches that follow these steps and have been applied to the Mag- natune dataset [13] in Section 3.1. We then present an ap- proach that deviates somewhat from the standard recipe by integrating learning steps before and after the temporal sum- marization. This paper has three main contributions. First, we de- scribe a simple adaptive preprocessing procedure of the mu- sic audio that incorporates only little prior knowledge on the nature of music audio. We show that the features ob- tained through this adaptive preprocessing give competitive results when using a relatively simple classifier. Secondly, we study the impact of the selection and mixing of pool- ing functions for summarization of the features over time. We introduce the idea of using min-pooling in conjunction with other functions. We show that combining several pool- ing functions improves the performance of the system. Fi- nally, we incorporate the idea of multiscale learning. In or- der to do this, we integrate feature learning, time summa- rization and classification in one deep learning step. Using this method, we obtain state-of-the-art performance on the Magnatagatune dataset. The paper is divided as follows. First, we motivate our experiments in Section 2. Then, we expose our experimen- tal setup in Section 3. We present and discuss our results in Section 4. Finally, we conclude in Section 5. 2. MOTIVATION 2.1 Choosing the right features Choosing the right features is crucial for music classifica- tion. Many automatic annotation systems use features such as MFCCs [8,12] because they have shown their worth in the speech recognition domain. However, music audio is very different from speech audio in many ways. So, MFCCs, which have been engineered for speech analysis might not be the optimal feature to use for music audio analysis. Alternatives have been proposed to replace MFCCs. Re- cent work have shown that better classification performance can be achieved by using mel-scaled energy bands of the