IEEE Proof
Web Version
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 1
Towards Cross-Version Harmonic Analysis of Music
Sebastian Ewert, Student Member, IEEE, Meinard Müller, Member, IEEE, Verena Konz, Daniel Müllensiefen, and
Geraint A. Wiggins
Abstract—For a given piece of music, there often exist multiple
versions belonging to the symbolic (e.g., MIDI representations),
acoustic (audio recordings), or visual (sheet music) domain. Each
type of information allows for applying specialized, domain-spe-
cific approaches to music analysis tasks. In this paper, we formulate
the idea of a cross-version analysis for comparing and/or combining
analysis results from different representations. As an example, we
realize this idea in the context of harmonic analysis to automati-
cally evaluate MIDI-based chord labeling procedures using anno-
tations given for corresponding audio recordings. To this end, one
needs reliable synchronization procedures that automatically es-
tablish the musical relationship between the multiple versions of
a given piece. This becomes a hard problem when there are sig-
nificant local deviations in these versions. We introduce a novel
late-fusion approach that combines different alignment procedures
in order to identify reliable parts in synchronization results. Then,
the cross-version comparison of the various chord labeling results
is performed only on the basis of the reliable parts. Finally, we show
how inconsistencies in these results across the different versions
allow for a quantitative and qualitative evaluation, which not only
indicates limitations of the employed chord labeling strategies but
also deepens the understanding of the underlying music material.
Index Terms—Alignment, chord recognition, music information
retrieval, music synchronization.
I. INTRODUCTION
A
MUSICAL work can be described in various ways using
different representations. Symbolic formats (e.g., Mu-
sicXML, MIDI, Lilypond) conventionally describe a piece of
music by specifying important musical parameters like pitch,
rhythm, and dynamics. Interpreting these parameters as part
of a musical performance leads to an acoustical representation
that can be described by audio formats encoding the physical
properties of sound (e.g., WAV, MP3). Depending on the type of
Manuscript received February 11, 2011; revised August 08, 2011 and
November 21, 2011; accepted February 20, 2012. The work of S. Ewert was
supported by the German Research Foundation (DFG CL 64/6-1). The work of
M. Müller and V. Konz was supported by Cluster of Excellence on Multimodal
Computing and Interaction (MMCI). The associate editor coordinating the
review of this manuscript and approving it for publication was Dr. Svetha
Venkatesh.
S. Ewert is with the Multimedia Signal Processing Group, Department of
Computer Science III, University of Bonn, Bonn, Germany (e-mail: ewerts@iai.
uni-bonn.de).
M. Müller and V. Konz are with the Saarland University and the Max-Planck
Institut Informatik, Saarbrücken, Germany (e-mail: meinard@mpi-inf.mpg.de;
vkonz@mpi-inf.mpg.de).
D. Müllensiefen is with the Department of Psychology, Goldsmiths, Univer-
sity of London, London, U.K. (e-mail: d.mullensiefen@gold.ac.uk).
G. A. Wiggins is with the Centre for Digital Music, School of Electronic En-
gineering and Computer Science, Queen Mary, University of London, London,
U.K. (e-mail: geraint.wiggins@eecs.qmul.ac.uk).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2012.2190047
Fig. 1. Cross-version music analysis based on synchronization techniques.
representation, some musical properties are directly accessible
while others may be implicit or even absent. For example, ex-
tracting pitch information from a MIDI file is straightforward,
while extracting the same information from an audio file is
a nontrivial task. On the other hand, while timbre and other
complex musical properties are richly represented in an audio
recording, the corresponding options in a MIDI file are very
limited. Thus, an audio recording is close to be expressively
complete in the sense that it represents music close to what is
heard by a listener [1]. On the other hand, a MIDI representation
contains structural information in an explicit form, but usually
does not encode expressive information. Such differences
between music representations allow for conceptually very
different approaches to higher-level music analysis tasks such
as melody extraction or structure analysis. Typically, each ap-
proach has intrinsic domain-specific strengths and weaknesses.
As our main conceptual contribution, we formulate the idea of
a cross-version analysis for comparing and/or combining anal-
ysis results from different domains. Our main idea is to incor-
porate music synchronization techniques to temporally align
music representations across the different domains (see Fig. 1).
Here, music synchronization refers to a procedure which, for a
given position in one representation of a piece of music, deter-
mines the corresponding position within another representation.
In general, a cross-version approach presents many varied op-
portunities to compare methods across different domains or to
create methods that unite the domain-specific strengths while
attenuating the weaknesses. In this paper, we present an in-
stance of such a cross-version analysis procedure, considering
the task of automated chord labeling. Here, the objective is to
induce the harmonic structure of a piece of music. The output
of a chord labeling process is a sequence of chord labels with
time stamps, either in musical time (i.e., in bars and beats) or in
physical time measured in seconds. Because chord progressions
1520-9210/$31.00 © 2012 IEEE