SEQUENTIAL INFERENCE OF RHYTHMIC STRUCTURE IN MUSICAL AUDIO Nick Whiteley A. Taylan Cemgil Simon Godsill University of Cambridge Department of Engineering Signal Processing and Communications Laboratory ABSTRACT This paper presents a framework for the modelling of tem- poral characteristics of musical signals and an approximate, sequential Monte Carlo inference scheme which yields esti- mates of tempo and rhythmic pattern from onset-time data. These two features are quantiｿed through the construction of a probabilistic dynamical model of a hidden ‘bar-pointer’ and a Poisson observation model. The capabilities of the sys- tem are demonstrated by tracking the tempo of a 2 against 3 polyrhythm and detecting a switch in rhythm in a MIDI per- formance. Index Terms— Music, Statistics, Poisson distributions, Monte Carlo methods 1. INTRODUCTION An important feature of intelligent music systems is the abil- ity to infer attributes related to temporal structure. These at- tributes may include musicological constructs such as tempo and rhythmic pattern. The recognition of these characteristics forms a sub-task of automatic music transcription - the un- supervised generation of a score, or description of an audio signal in terms of musical concepts. For music categorization systems, tempo and rhythmic pattern are deｿning features of genre and therefore useful features for indexing of data sets. Much work has been done on detecting the ‘pulse’ or foot- tapping rate of musical audio signals [1],[2]. However these approaches do not distinguish between tempo and rhythm. Goto and Muraoka detail a system which recognizes beats in terms of the ‘reliability’ of hypotheses for different rhythmic patterns [3]. Cemgil and Kappen model MIDI onset events in terms of a tempo process and switches between quantized score locations [4]. Raphael independently proposed a similar system [5]. Hainsworth and Macleod infer beats in a similar framework from raw audio signals [6], but rhythmic pattern is still not explicitly modelled. Takeda et al. perform tempo and rhythm recognition from MIDI data by analogy with speech-recognition, but do not accommodate polyrhythms [7]. Klapuri et al. deｿne metrical structure in terms of pulse sensations on different time scales, but do not explicitly discriminate between different rhythmic patterns [8]. In [9], a novel model of temporal structure in musical sig- nals was introduced where exact inference was feasible. How- ever, for certain extensions of the model, the exact inference scheme suffered from high computational requirements since it involved storage and manipulation of very large vectors. In this paper we focus on the development of a practi- cally scalable, sequential Monte Carlo inference scheme for a model of tempo and rhythmic pattern analogous to that in [9]. Development of such an inference scheme is challeng- ing in this case due to the multi-modality of posterior prob- ability distributions. In practical terms, this issue arises for the same reasons that human listeners can often ‘explain’ the same piece of music in terms of several different combina- tions of tempo and rhythmic pattern. Whilst the examples in this paper take as input MIDI onset data, the same framework could be used with onset times obtained from existing onset detection systems, e.g. [10]. In the Bayesian paradigm the task of joint estimation of tempo and rhythmic pattern is treated as an inference prob- lem, where given a sequence of observations y 1:n ≡ (y 1 ,y 2 , ..., y n ) the aim is to compute posterior den- sities over the hidden state variables x 0:n ≡ (x 0 , x 1 , ..., x n ). In a sequential setting we ｿrst postulate a Markovian prior density over the hidden state variables, p(x k+1 |x k ), which describes how the state variables evolve from one time index to the next. The observations are then related to the hidden state via p(y k |x k ). Up to a constant of proportionality, the joint posterior density is given by: p(x 0:n |y 1:n ) ∝ p(x 0 ) n  k=1 p(y k |x k )p(x k |x k-1 ) (1) 2. BAR-POINTER MODEL The system is built around a dynamical model of a ‘bar-pointer’, a hypothetical, hidden object which maps an observed time- series to one period of a latent rhythmical pattern, i.e. one bar. At time t k = kΔ, k ∈{1, 2, ..., n} and Δ a constant, denote by φ k ∈ [0, 1) the position of the bar-pointer and denote by ˙ φ k ∈ [ ˙ φ min , ˙ φ max ] its velocity, where ˙ φ min > 0. The proba- bilistic kinematics of the bar-pointer are modelled as being a piece-wise constant velocity process: IV  1321 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007