HUMAN SIMILARITY JUDGMENTS: IMPLICATIONS FOR THE DESIGN OF FORMAL EVALUATIONS M. Cameron Jones J. Stephen Downie Andreas F. Ehmann International Music Information Retrieval Systems Evaluation Laboratory Graduate School of Library and Information Science University of Illinois at Urbana-Champaign ABSTRACT This paper presents findings of a series of analyses of human similarity judgments from the Symbolic Melodic Similarity, and Audio Music Similarity tasks from the Music Information Retrieval Evaluation Exchange (MIREX) 2006. The categorical judgment data generated by the evaluators is analyzed with regard to judgment stability, inter-grader reliability, and patterns of disagreement, both within and between the two tasks. An exploration of this space yields implications for the design of MIREX-like evaluations. 1. INTRODUCTION The International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) at the University of Illinois at Urbana-Champaign has been hosting and running the Annual Music Information Retrieval Evaluation eXchange (MIREX) since 2005. Inspired by TREC, the goal of MIREX is to formally evaluate state- of-the-art algorithms for Music Information Retrieval (MIR) systems [2]. MIREX 2006 comprised nine separate evaluation tasks which were defined by community input [5]. Two of these tasks, “Symbolic Melodic Similarity” (SMS) and “Audio Music Similarity and Retrieval” (AMS), called for human judgments of similarity in order to establish ground truth for the evaluation of the submitted algorithms. In order to capture these similarity judgments we created a new web-based tool called the “Evalutron 6000” (E6K). In this paper, we present findings from our analysis of categorical human similarity judgment data collected using the E6K. We explore the consistency of the graders’ scoring, measuring the amount of disagreement among graders. We discuss the implications of our findings for the design of future tasks which utilize human judgements of similarity in the MIR domain. 2. DATA CAPTURE: EVALUTRON 6000 The SMS and AMS tasks shared a common structure. Each task participant’s algorithm was run against a collection of either symbolic or audio music files. For each query, each algorithm returned a list of the top- ranked “candidate” songs, the length n of the candidate lists was ten for SMS and five for AMS. All resulting candidates for each query were merged and then evaluated by graders using the E6K. In the E6K, graders score the anonymized set of candidates for each query anonymously. Individual graders are tracked, but their scores are kept independent of their identities. This tracking allowed us to log each grader’s interactions with the E6K. Events logged include: score inputs, score modifications, auditions, etc. Table 1 provides descriptive statistics for each the two evaluation tasks. SMS AMS No. of events logged No. of submitted algorithms Total no. of queries Total no. of query-candidate pairs No. of graders No. of queries per grader Avg. size of candidate lists Avg. no. of evaluations per grader 23,491 8 17 905 21 15 15 225 46,254 6 60 1,629 24 7-8 27 205 Table 1. Summary of Evalutron 6000 statistics. After listening to each query-candidate pair, graders were asked to rate the degree of similarity of the candidate to the query in two ways: 1) by selecting one of the three BROAD categories of similarity: Not Similar (NS), Somewhat Similar (SS), and Very Similar (VS); and, 2) by assigning a FINE score between 0.0 (Least similar) and 10.0 (Most similar). Each query- candidate pair was evaluated by three different graders. Data were collected between 5 Sept. and 20 Sept., 2006, from volunteer graders from the MIR/MDL research community, representing 11 different countries. 3. MEASURING DISAGREEMENT Understanding the consistency of the graders’ judgments is essential to interpreting human judgments of similarity in contexts such as MIREX. Previous studies [1,6] have analyzed the consistency of judgments between BROAD scores and FINE scores. Figure 1 shows the consistency of assignment of FINE scores within BROAD categories for both SMS and AMS tasks. The variation of FINE scores within the BROAD SS category for AMS is particularly interesting, indicating that graders were not very consistent in assigning FINE scores to items they had marked as Somewhat Similar (SS) for this task. The differences between the tasks and of the consistency of FINE scores are discussed in more detail in [1]. © 2007 Austrian Computer Society (OCG).