MUSIC STRUCTURE ANALYSIS WITH A PROBABILISTIC FITNESS FUNCTION IN MIREX2009 Jouni Paulus and Anssi Klapuri Department of Signal Processing Tampere University of Technology jouni.paulus@tut.fi, anssi.klapuri@tut.fi ABSTRACT This paper describes the method we submitted for the “Struc- tural Segmentation” task at MIREX2009. The method de- ﬁnes a ﬁtness function for structural descriptions based on the idea that all occurrences of a musical part should be acoustically similar and differ from the occurrences of other parts. The method creates a large set of potential segments, estimates the probability of each two segments to be occurrences of the same part, and uses these proba- bilities in a ﬁtness function. The function optimisation is done with a greedy search algorithm. 1. INTRODUCTION Music piece structure analysis refers to the task of provid- ing a temporal segmentation of the piece into occurrences of musical parts, such as “chorus” and “verse”, and group- ing of the occurrences of a part. This kind of a analysis is meaningful on pieces having a sectional form. An occur- rence of a musical part is often 20–30 s in length and may be repeated later in the piece. Various methods for music structure analysis have been proposed in the literature, for an overview of the basic principles refer to [1]. The main method categorisation provided in [2] divides the methods into “state” and “se- quence” approaches. The former considers the piece to be produced by a state machine, while the latter assumes that the piece contains repeated sequences of musical events. The method proposed in this paper belongs into the “state” category, or it can be considered to belong to a third cate- gory: “ﬁtness” function based approaches. The submitted method uses three acoustic features de- scribing different aspects of the piece, creates several po- tential segmentations, matches each segment pair with two distance measures, and produces probabilities for the two segments to be occurrences of the same part. The proba- bilities are used in a ﬁtness function for descriptions of the piece structure, and a greedy search algorithm is employed for the function optimisation. For more details, see [3]. This work was supported by the Academy of Finland, (application number 129657, Finnish Programme for Centres of Excellence in Re- search 2006–2011). 2. METHOD DESCRIPTION The method starts by estimating a musical beat grid with the method from [4]. The reliability of the estimation is improved by a two-pass scheme: ﬁrst, only a 20 s excerpt is analysed. The produced period estimate is then used to sharpen the prior distribution of beat length by setting the Gaussian distribution mean to the estimated period value and halving the original variance parameter value. Then the entire signal is analysed. Still, the resulting beat grid may have π-phase errors. The effect of this is reduced by halving the period, producing a half-beat grid. Raw acoustic feature extraction is done from 4096 sam- ple frames with 50% overlap. 13 mel-frequency cepstral coefﬁcients (MFCCs) from the output of a 42-band trian- gular mel-scaled ﬁlter bank are calculated and the low- est coefﬁcient is discarded. The second acoustic feature used is chroma, which is calculated with the method de- scribed in [5]. It estimates the saliences of different fun- damental frequencies in the range 80–640 Hz, resamples the frequency scale to a semitone scale by retaining only the maximum salience in each semitone range, and ﬁnally produces the chroma by octave folding. The features are then temporally resampled to the beat-synchronised grid. The acoustic features are then focused on two time scales by Hanning window weighted median ﬁltering. The ﬁner time-scale features are obtained by skipping the ﬁltering, while the coarser time-scale features are obtained with 33 and 65 frame ﬁltering windows for MFCCs and chroma respectively. In addition to MFCCs and chroma, a third acoustic feature, rhythmogram [6], is calculated. The cal- culation uses the onset detection accent function produced by the beat estimation instead of the perceptual spectral ﬂux proposed in the original publication. The feature itself is simply the autocorrelation of the accent function calcu- lated in sliding windows of 33 half-beat frames in length. All the features are ﬁnally normalised to zero mean and unity variance over time. From the ﬁve acoustic features (MFCC and chroma on two temporal scales, and rhythmogram), separate self-distance matrices (SDMs) are calculated using cosine distance mea- sure. A set of candidate segmentation points is generated with novelty vector calculation [7]. A Gaussian tapered 40 × 40 checkerboard kernel matrix is correlated along the main diagonals of the SDMs and the resulting novelty vec- tors are summed. Maximum of 30 highest local maxima