IEEE Proof Web Version IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 1 Towards Cross-Version Harmonic Analysis of Music Sebastian Ewert, Student Member, IEEE, Meinard Müller, Member, IEEE, Verena Konz, Daniel Müllensiefen, and Geraint A. Wiggins Abstract—For a given piece of music, there often exist multiple versions belonging to the symbolic (e.g., MIDI representations), acoustic (audio recordings), or visual (sheet music) domain. Each type of information allows for applying specialized, domain-spe- ciﬁc approaches to music analysis tasks. In this paper, we formulate the idea of a cross-version analysis for comparing and/or combining analysis results from different representations. As an example, we realize this idea in the context of harmonic analysis to automati- cally evaluate MIDI-based chord labeling procedures using anno- tations given for corresponding audio recordings. To this end, one needs reliable synchronization procedures that automatically es- tablish the musical relationship between the multiple versions of a given piece. This becomes a hard problem when there are sig- niﬁcant local deviations in these versions. We introduce a novel late-fusion approach that combines different alignment procedures in order to identify reliable parts in synchronization results. Then, the cross-version comparison of the various chord labeling results is performed only on the basis of the reliable parts. Finally, we show how inconsistencies in these results across the different versions allow for a quantitative and qualitative evaluation, which not only indicates limitations of the employed chord labeling strategies but also deepens the understanding of the underlying music material. Index Terms—Alignment, chord recognition, music information retrieval, music synchronization. I. INTRODUCTION A MUSICAL work can be described in various ways using different representations. Symbolic formats (e.g., Mu- sicXML, MIDI, Lilypond) conventionally describe a piece of music by specifying important musical parameters like pitch, rhythm, and dynamics. Interpreting these parameters as part of a musical performance leads to an acoustical representation that can be described by audio formats encoding the physical properties of sound (e.g., WAV, MP3). Depending on the type of Manuscript received February 11, 2011; revised August 08, 2011 and November 21, 2011; accepted February 20, 2012. The work of S. Ewert was supported by the German Research Foundation (DFG CL 64/6-1). The work of M. Müller and V. Konz was supported by Cluster of Excellence on Multimodal Computing and Interaction (MMCI). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Svetha Venkatesh. S. Ewert is with the Multimedia Signal Processing Group, Department of Computer Science III, University of Bonn, Bonn, Germany (e-mail: ewerts@iai. uni-bonn.de). M. Müller and V. Konz are with the Saarland University and the Max-Planck Institut Informatik, Saarbrücken, Germany (e-mail: meinard@mpi-inf.mpg.de; vkonz@mpi-inf.mpg.de). D. Müllensiefen is with the Department of Psychology, Goldsmiths, Univer- sity of London, London, U.K. (e-mail: d.mullensiefen@gold.ac.uk). G. A. Wiggins is with the Centre for Digital Music, School of Electronic En- gineering and Computer Science, Queen Mary, University of London, London, U.K. (e-mail: geraint.wiggins@eecs.qmul.ac.uk). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TMM.2012.2190047 Fig. 1. Cross-version music analysis based on synchronization techniques. representation, some musical properties are directly accessible while others may be implicit or even absent. For example, ex- tracting pitch information from a MIDI ﬁle is straightforward, while extracting the same information from an audio ﬁle is a nontrivial task. On the other hand, while timbre and other complex musical properties are richly represented in an audio recording, the corresponding options in a MIDI ﬁle are very limited. Thus, an audio recording is close to be expressively complete in the sense that it represents music close to what is heard by a listener [1]. On the other hand, a MIDI representation contains structural information in an explicit form, but usually does not encode expressive information. Such differences between music representations allow for conceptually very different approaches to higher-level music analysis tasks such as melody extraction or structure analysis. Typically, each ap- proach has intrinsic domain-speciﬁc strengths and weaknesses. As our main conceptual contribution, we formulate the idea of a cross-version analysis for comparing and/or combining anal- ysis results from different domains. Our main idea is to incor- porate music synchronization techniques to temporally align music representations across the different domains (see Fig. 1). Here, music synchronization refers to a procedure which, for a given position in one representation of a piece of music, deter- mines the corresponding position within another representation. In general, a cross-version approach presents many varied op- portunities to compare methods across different domains or to create methods that unite the domain-speciﬁc strengths while attenuating the weaknesses. In this paper, we present an in- stance of such a cross-version analysis procedure, considering the task of automated chord labeling. Here, the objective is to induce the harmonic structure of a piece of music. The output of a chord labeling process is a sequence of chord labels with time stamps, either in musical time (i.e., in bars and beats) or in physical time measured in seconds. Because chord progressions 1520-9210/$31.00 © 2012 IEEE