A SPECTRAL DIFFERENCE APPROACH TO DOWNBEAT EXTRACTION IN MUSICAL AUDIO Matthew E. P. Davies and Mark D. Plumbley Centre for Digital Music, Queen Mary, University of London Mile End Road, London, E1 4NS, United Kingdom matthew.davies@elec.qmul.ac.uk ABSTRACT We introduce a method for detecting downbeats in musical audio given a sequence of beat times. Using musical knowl- edge that lower frequency bands are perceptually more im- portant, we find the spectral difference between band-limited beat synchronous analysis frames as a robust downbeat indi- cator. Initial results are encouraging for this type of system. 1. INTRODUCTION Numerous approaches exist for the problem of beat tracking (e.g.[1, 2, 3, 4]), that of replicating the human ability of tap- ping in time to music. However much less attention has been given to higher level metrical analysis. One such problem is the extraction of downbeats from musical audio i.e. finding the first beat of each bar. A robust downbeat extractor could be of considerable use within the context of music information retrieval: to enable fully automated rhythmic pattern analysis for genre classifi- cation [5]; to indicate likely temporal boundaries for struc- tural audio segmentation [6]; and to improve the robustness of beat tracking systems by applying higher level knowledge [7]. The principal difficulty appears not in finding the number of beats per bar, the time-signature, but resolving the phase of the bar-level periodicity [7]. While this might appear a sim- ple task, very few techniques have been found effective for solving this particular problem. Goto [2] presents two approaches to downbeat estimation: for percussive music, automatically detected kick and snare drum events are compared to pre-defined rhythmic tem- plate patterns; for non-percussive music, short-term spectral frames (band-limited to 1kHz) are peak-picked and then his- togrammed into beat length segments, where chord changes are used to infer higher level metrical structure. The two methods are combined within a single rhythm tracking sys- tem [2] which is shown to be highly accurate and operates in real-time. Goto’s system however, has only been fully tested on a popular music database and restricted to music in 4/4 time with a constant tempo between 61 and 120 beats per minute (bpm). Klapuri, Eronen and Astola [7] propose a meter tracking system which uses comb filter analysis within a probabilis- tic framework to simultaneously track three metrical levels: the tatum, tactus and measure. The phase of measure-level events, i.e. downbeats, are identified by matching rhythmic pattern templates to a mid-level representation calculated in four parallel sub-bands, where most emphasis is given to the lowest of these bands. Klapuri et al present results over a more varied test database than Goto’s algorithm [2] and in- clude cases which exhibit tempo variation. We therefore con- sider this approach the current state of the art for downbeat estimation. In this paper we introduce a spectral difference approach to downbeat estimation. Although related to Goto’s approach [2], we propose that percussive events and harmonic change can be used implicitly within a single spectral representation to infer downbeats. We require a sequence of beat times and the time-signature of the input signal to be known a priori – both of which are detected within our previously developed beat tracking system [1]. We partition an input signal into band-limited beat length frames and use the musical knowl- edge that lower pitched events are perceptually more impor- tant [4] by preserving spectral information within the range 0–1.4kHz. We calculate the Kullback-Leibler divergence be- tween successive beat frames to form a spectral difference function. Downbeats are selected as those beats which glob- ally lead to most spectral change. We evaluate our downbeat model against that of Klapuri et al [7], with initial results indicating better performance for our model. However, current analysis is restricted to cases where the time-signature does not change and the tempo is approximately constant. Figure 1: Overview of downbeat extraction model The remainder of this paper is structured as follows. In sec- tion 2 we describe our approach to downbeat extraction. Sec- tion 3 contains results from an objective and subjective eval- uation of our system with discussion and conclusions in sec- tions 4 and 5.