TOOLS AND ARCHITECTURE FOR THE EVALUATION OF SIMILARITY MEASURES : CASE STUDY OF TIMBRE SIMILARITY Jean-Julien Aucouturier SONY CSL Paris 6, rue Amyot 75005 Paris, France. Francois Pachet SONY CSL Paris 6, rue Amyot 75005 Paris, France. ABSTRACT The systematic testing of the very many parameters and algorithmic variants involved in the design of high-level music descriptors at large, and similarity measure in par- ticular, is a daunting task, which requires the building of a general architecture which is nearly as complex as a full- fledge Music Browsing system. In this paper, we report on experiments done in an attempt to improve the perfor- mance of the music similarity measure described in [2], using the Cuidado Music Browser ([8]). We do not prin- cipally report on the actual results of the evaluation, but rather on the methodology and the various tools that were built to support such a task. We show that many non- technical browsing features are useful at various stages of the evaluation process, and in turn that some of the tools developed for the expert user can be reinjected into the Music Browser, and benefit the non-technical user. 1. INTRODUCTION The domain of Electronic Music Distribution has gained worldwide attention recently with progress in middleware, networking and compression. However, its success de- pends largely on the existence of robust, perceptually rel- evant music similarity relations. It is only with efficient content management techniques that the millions of mu- sic titles produced by our society can be made available to its millions of users. 1.1. Case study : Timbre Similarity In [2], we have proposed to compute automatically music similarities between music titles based on their global tim- bre quality. Typical examples of timbre similarity as we define it are : a Schumann sonata (“Classical”) and a Bill Evans piece (“Jazz”) are similar because they both are ro- mantic piano pieces, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. c 2004 Universitat Pompeu Fabra. A Nick Drake tune (“Folk”), an acoustic tune by the Smashing Pumpkins (“Rock”), a bossa nova piece by Joao Gilberto (“World”) are similar because they all consist of a simple acoustic guitar and a gentle male voice, etc. Timbre Similarity has seen a growing interest in the Music Information Retrieval community lately (see [3, 4, 7, 11], and [1] for a complete review). Each contribution often is yet another instantiation of the same basic pattern recog- nition architecture, only with different algorithm variants and parameters. The signal is cut into short overlapping frames (usually between 20 and 50ms and a 50% over- lap), and for each frame, a feature vector is computed, which usually consists of Mel Frequency cepstrum Co- efficients (MFCC). The number of MFCCs is an impor- tant parameter, and each author comes up with a different number. Then a statistical model of the MFCCs’ distri- bution is computed, e.g. K-means or Gaussian Mixture Models (GMMs). Once again, the number of kmean or GMM centres is a discussed parameter which has received a vast number of answers in the litterature. Finally, mod- els are compared with different techniques, e.g. Monte Carlo sampling, Earth Mover’s distance or Asymptotic Likelihood Approximation. All these contributions give encouraging results with a little effort and imply that near- perfect results would just extrapolate by fine-tuning the algorithms’ parameters. However, such extensive testing over large, dependent parameter spaces is both difficult and costly. 1.2. Evaluation The algorithm used for timbre similarity comes with very many variants, and has very many parameters to select. The parameter space for the original algorithm is at least 6-dimensional: sample rate, number of MFCCs (N), num- ber of components (M), distance sample rate (for Monte Carlo), alternative distance (EMD, etc.), window size. Moreover, some of these parameters are not independent, e.g. there is an optimal balance to be found between high dimensionality (N) and high precision of the modeling (M). Additionally, the original algorithm may be modified by a number of classical pre/postprocessing, such as ap- pending delta coefficients or 0th coefficient to the MFCC set. Finally, one would also like to test a number of vari-