HUMAN SIMILARITY JUDGMENTS: IMPLICATIONS
FOR THE DESIGN OF FORMAL EVALUATIONS
M. Cameron Jones J. Stephen Downie Andreas F. Ehmann
International Music Information Retrieval Systems Evaluation Laboratory
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
ABSTRACT
This paper presents findings of a series of analyses of
human similarity judgments from the Symbolic Melodic
Similarity, and Audio Music Similarity tasks from the
Music Information Retrieval Evaluation Exchange
(MIREX) 2006. The categorical judgment data
generated by the evaluators is analyzed with regard to
judgment stability, inter-grader reliability, and patterns
of disagreement, both within and between the two tasks.
An exploration of this space yields implications for the
design of MIREX-like evaluations.
1. INTRODUCTION
The International Music Information Retrieval Systems
Evaluation Laboratory (IMIRSEL) at the University of
Illinois at Urbana-Champaign has been hosting and
running the Annual Music Information Retrieval
Evaluation eXchange (MIREX) since 2005. Inspired by
TREC, the goal of MIREX is to formally evaluate state-
of-the-art algorithms for Music Information Retrieval
(MIR) systems [2].
MIREX 2006 comprised nine separate evaluation
tasks which were defined by community input [5]. Two
of these tasks, “Symbolic Melodic Similarity” (SMS)
and “Audio Music Similarity and Retrieval” (AMS),
called for human judgments of similarity in order to
establish ground truth for the evaluation of the
submitted algorithms. In order to capture these
similarity judgments we created a new web-based tool
called the “Evalutron 6000” (E6K).
In this paper, we present findings from our analysis
of categorical human similarity judgment data collected
using the E6K. We explore the consistency of the
graders’ scoring, measuring the amount of disagreement
among graders. We discuss the implications of our
findings for the design of future tasks which utilize
human judgements of similarity in the MIR domain.
2. DATA CAPTURE: EVALUTRON 6000
The SMS and AMS tasks shared a common structure.
Each task participant’s algorithm was run against a
collection of either symbolic or audio music files. For
each query, each algorithm returned a list of the top-
ranked “candidate” songs, the length n of the candidate
lists was ten for SMS and five for AMS. All resulting
candidates for each query were merged and then
evaluated by graders using the E6K.
In the E6K, graders score the anonymized set of
candidates for each query anonymously. Individual
graders are tracked, but their scores are kept
independent of their identities. This tracking allowed us
to log each grader’s interactions with the E6K. Events
logged include: score inputs, score modifications,
auditions, etc. Table 1 provides descriptive statistics for
each the two evaluation tasks.
SMS AMS
No. of events logged
No. of submitted algorithms
Total no. of queries
Total no. of query-candidate pairs
No. of graders
No. of queries per grader
Avg. size of candidate lists
Avg. no. of evaluations per grader
23,491
8
17
905
21
15
15
225
46,254
6
60
1,629
24
7-8
27
205
Table 1. Summary of Evalutron 6000 statistics.
After listening to each query-candidate pair, graders
were asked to rate the degree of similarity of the
candidate to the query in two ways: 1) by selecting one
of the three BROAD categories of similarity: Not
Similar (NS), Somewhat Similar (SS), and Very Similar
(VS); and, 2) by assigning a FINE score between 0.0
(Least similar) and 10.0 (Most similar). Each query-
candidate pair was evaluated by three different graders.
Data were collected between 5 Sept. and 20 Sept., 2006,
from volunteer graders from the MIR/MDL research
community, representing 11 different countries.
3. MEASURING DISAGREEMENT
Understanding the consistency of the graders’
judgments is essential to interpreting human judgments
of similarity in contexts such as MIREX. Previous
studies [1,6] have analyzed the consistency of
judgments between BROAD scores and FINE scores.
Figure 1 shows the consistency of assignment of FINE
scores within BROAD categories for both SMS and
AMS tasks. The variation of FINE scores within the
BROAD SS category for AMS is particularly
interesting, indicating that graders were not very
consistent in assigning FINE scores to items they had
marked as Somewhat Similar (SS) for this task. The
differences between the tasks and of the consistency of
FINE scores are discussed in more detail in [1].
© 2007 Austrian Computer Society (OCG).