EXTRACTING THE PERCEPTUAL TEMPO FROM MUSIC Martin F. McKinney Philips Research Laboratories Eindhoven, The Netherlands Dirk Moelants IPEM-Department of Musicology Ghent University, Belgium ABSTRACT The study presented here outlines a procedure for mea- suring and quantitatively representing the perceptual tempo of a musical excerpt. We also present a method for apply- ing such measures of perceptual tempo to the design of automatic tempo-trackers in order to more accurately rep- resent the perceived beat in music. Keywords: Tempo, Perception, Beat-tracking 1. INTRODUCTION Tempo is a basic element and useful descriptive parameter of music and has been the focus of many systems for au- tomatic music information retrieval, i.e., automatic tempo trackers [9]. When describing musical tempo, it is often useful to make a distinction between notated tempo and perceptual tempo. Notated and perceptual tempo can dif- fer in that, for a given excerpt of music, there is only a single notated tempo, while listeners unfamiliar with the score can perceive the tempo to exist at different metri- cal levels [6]. For some pieces of music, the perceptual tempo is quite ambiguous, while for others it is not. It is often desirable to have a representation of perceptual tempo rather than notated tempo, especially in situations where the notated tempo of an audio track is unknown or unavailable. A common problem with systems for automatic tempo extraction is that they do not distinguish between notated and perceptual tempo and, as a result, cannot reliably rep- resent either form of tempo. A further consequence of this shortcoming is that meaningful performance evaluation of such systems is difficult because the form of the output is poorly defined. This study provides a method for measuring and char- acterizing the perceptual tempo of musical excerpts and then applying the characterization to the development and testing of automatic tempo extractors. Previous studies on the perception of pulse have shown that listeners tend to prefer tempi near a “resonance” of ∼120 beats per minute [3, 11, 7]. When subjects were Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. c  2004 Universitat Pompeu Fabra. asked to tap to the beat in studies using artificial tone se- quences and musical excerpts as stimuli, they would pref- erentially tap at metrical levels whose tempi were in this resonant range. We have shown in a similar study that, for individual musical excerpts, a resonance model can pre- dict the distribution of subjects’ tapped tempi for some but not all excerpts [6]. Several factors, including vari- ous types of rhythmic accents (e.g., dynamic and dura- tional), are likely to cause the distribution of perceptual tempi for some excerpts to deviate from a simple reso- nance representation [8]. A proper system for tempo ex- traction should accurately represent the perceptual tempo of music in all cases, including those that are not eas- ily represented by a simple resonant model and those in which the perceptual tempo is ambiguous. The typical structure of a system for tempo extraction can be divided into two stages: 1) a stage that generates a representation of temporal dynamics either by extract- ing it from audio (e.g., taking the derivative of the sig- nal energy in a number of frequency bands [9]) or deriv- ing it from a symbolic representation such as MIDI; and 2) a secondary stage that tabulates periodic regularities in the driving signal produced in the first stage, for example, through the use of resonator filter banks [9, 5], multi-agent methods [4], or probabilistic models [1]. It is common that a single tempo value or a list of candidate values is gener- ated from these tabulations of periodicities to represent the tempo of a given piece of music. However, a more inter- mediate representation, such as beat histograms (see [10]) can be a more valuable description of the tempo when try- ing to relate it to the actual perceived tempo. Here we show how such representations can be used in conjunction with perceptual data to tune systems for tempo extraction so that they more accurately represent perceptual tempo. 2. METHOD We performed an experiment in which listeners were asked to tap to the beat of 24 10-second musical excerpts cov- ering a wide range of musical styles (see Appendix). We derived a measure of perceived tempo from the tapping times using linear regression and generated histograms of all subjects’ perceived tempi for each excerpt. These histograms of perceived tempo served as the “group re- sponse” for each excerpt and were taken to represent the overall perceived tempo for a particular excerpt. Analogs of the perceived-tempo histograms were auto- matically generated from the audio waveforms of the ex-