A Robust Mid-level Representation for Harmonic Content in Music Signals Juan P. Bello and Jeremy Pickens Centre for Digital Music Queen Mary, University of London London E1 4NS, UK juan.bello-correa@elec.qmul.ac.uk ABSTRACT When considering the problem of audio-to-audio match- ing, determining musical similarity using low-level fea- tures such as Fourier transforms and MFCCs is an ex- tremely difﬁcult task, as there is little semantic informa- tion available. Full semantic transcription of audio is an unreliable and imperfect task in the best case, an unsolved problem in the worst. To this end we propose a robust mid-level representation that incorporates both harmonic and rhythmic information, without attempting full tran- scription. We describe a process for creating this represen- tation automatically, directly from multi-timbral and poly- phonic music signals, with an emphasis on popular mu- sic. We also offer various evaluations of our techniques. Moreso than most approaches working from raw audio, we incorporate musical knowledge into our assumptions, our models, and our processes. Our hope is that by utiliz- ing this notion of a musically-motivated mid-level repre- sentation we may help bridge the gap between symbolic and audio research. Keywords: Harmonic description, segmentation, music similarity 1 Introduction Mid-level representations of music are measures that can be computed directly from audio signals using a combi- nation of signal processing, machine learning and musical knowledge. They seek to emphasize the musical attributes of audio signals (e.g. chords, rhythm, instrumentation), attaining higher levels of semantic complexity than low- level features (e.g. spectral coefﬁcients, MFCC, etc), but without being bounded by the constraints imposed by the rules of music notation. Their appeal resides in their abil- ity to provide a musically-meaningful description of audio signals that can be used for music similarity applications, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee pro- vided that copies are not made or distributed for proﬁt or com- mercial advantage and that copies bear this notice and the full citation on the ﬁrst page. c 2005 Queen Mary, University of London such as retrieval, segmentation, classiﬁcation and brows- ing in musical collections. Previous attempts to model music from complex au- dio signals concentrate mostly on the attributes of timbre and rhythm (Aucouturier and Pachet, 2002; Yang, 2002). These methods are usually limited by the simplicity of their selected feature set, which can be often regarded as low-level. Dixon et al. (2004) demonstrated that it is possible to successfully characterize music according to rhythm by adding higher-level descriptors to a low-level feature set. These descriptors are more readily available for rhythm than for harmony as the state-of-the-art in beat, meter tracking and tempo estimation has had more success than similar efforts on chord and melody estimation. Pickens et al. (2002) showed success at identifying harmonic similarities between a polyphonic audio query and symbolic polyphonic scores. The approach relied on automatic transcription, a process which is partially effec- tive within a highly constrained subset of musical record- ings (e.g. mono-timbral, no drums or vocals, small poly- phonies). To effectively retrieve despite transcription er- rors, all symbolic data was converted to harmonic distri- butions and similarity was measured by computing the distance between two distributions over the same event space. This is an inefﬁcient process that goes to the un- necessary step of transcription before the construction of an abstract representation of the harmony of the piece. In this paper we propose a method for semantically describing harmonic content directly from music signals. Our goal is not to do a formal harmonic analysis but to produce a robust and consistent harmonic description use- ful for similarity-based applications. We do this with- out attempting to estimate the pitch of notes in the mix- ture. By avoiding the transcription step, we also avoid its constraints, allowing us to operate on a wide variety of music. The approach combines a chroma-based repre- sentation and a hidden Markov model (HMM) initialized with musical knowledge and partially trained on the sig- nal data. The output, which is a function of beats (tactus) instead of time, represents the sequence of major and mi- nor triads that describe the harmonic character of the input signal. The remainder of this paper is organized as follows: Section 2 reviews previous work on this area; Section 3 gives details about the construction of the feature vector; Section 4 explains the used model and justiﬁes our ini- 304