AUDIO-TO-SCORE ALIGNMENT AT NOTE LEVEL FOR ORCHESTRAL RECORDINGS Marius Miron, Julio Jos´ e Carabias-Orti, Jordi Janer Music Technology Group, Universitat Pompeu Fabra marius.miron,julio.carabias,jordi.janer@upf.edu ABSTRACT In this paper we propose an offline method for refining audio-to-score alignment at the note level in the context of orchestral recordings. State-of-the-art score alignment systems estimate note onsets with a low time resolution, and without detecting note offsets. For applications such as score-informed source separation we need a precise align- ment at note level. Thus, we propose a novel method that refines alignment by determining the note onsets and off- sets in complex orchestral mixtures by combining audio and image processing techniques. First, we introduce a note-wise pitch salience function that weighs the harmonic contribution according to the notes present in the score. Second, we perform image binarization and blob detection based on connectivity rules. Then, we pick the best com- bination of blobs, using dynamic programming. We finally obtain onset and offset times from the boundaries of the most salient blob. We evaluate our method on a dataset of Bach chorales, showing that the proposed approach can accurately estimate note onsets and offsets. 1. INTRODUCTION Audio-to-score alignment concerns synchronizing the notes in a musical score with the corresponding audio rendition. An additional step, alignment at the note level, aims at ad- justing the note onsets, in order to further minimize the error between the score and audio. In the context of or- chestral music, this task is challenging; first, because of the complex polyphonies, and, second, because of the tim- ing expressivity of classical music. As possible applications of note alignment, deriving the exact locations of the note onsets and offsets could improve tasks as score-informed source separation [6], [2], [7]. State-of-the-art score alignment methods use Non- negative matrix factorization (NMF) [14], [11], template adaptation through expectation maximization [9], dynamic time warping (DTW) [3], and Hidden Markov Models (HMM) [4, 6]. The method described in [11, p. 103] is the only one addressing explicitly the topic of fine note c Marius Miron, Julio Jos´ e Carabias-Orti, Jordi Janer. Licensed under a Creative Commons Attribution 4.0 International Li- cense (CC BY 4.0). Attribution: Marius Miron, Julio Jos´ e Carabias- Orti, Jordi Janer. “Audio-to-score alignment at note level for orchestral recordings”, 15th International Society for Music Information Retrieval Conference, 2014. alignment as a post-processing step. A factorization is per- formed to obtain the onsets of the anchor notes. The basis vectors are trained with piano pitches models, and the on- sets are obtained from the activations matrix. Furthermore, an additional step is performed in order to look for onsets between anchors. However, the methods listed above have certain limita- tions. First, accurately detecting the offset of the note is a challenging problem and none of these methods claim to solve it. Second, the scope of the NMF-based systems is solely piano recordings. Third, except [11], the algorithms consider a large window to evaluate detected onsets. Note that the MIREX Real-time Audio-to-Score Alignment task considers a 2000 ms window size. With respect to image processing techniques deployed in music information research, a system to link audio and scores for makam music is presented in [13]. In this case, Hough transform is used for picking the line correspond- ing to the most likely path from a binarized distance ma- trix. Additionally, the same transform is used in [1] to find repeating patterns for audio thumbnailing. In this paper we propose a novel method for audio-to- score alignment at the note level, which combines audio and image processing techniques. In comparison to classi- cal audio-to-score alignment methods, we aim to detect the offset of the note, along with its onset. Additionally, we do not assume a constant delay between score and audio, thus we do not use any information regarding the beats, tempo or note duration, in order to adjust the onsets. Therefore, our method can align notes when dealing with variable de- lays, as the ones resulting from automatic score alignment or the ones yielded by manually aligning the score at the beat level. The proposed method is based on two stages. First, the audio processing stage involves filtering the spectral peaks in time and frequency for every note. Consequently, the filtering occurs in the time interval restricted for each note and in the frequency bands of the harmonic partials corre- sponding to its fundamental frequency. Furthermore, we decrease the magnitudes of the peaks which are overlap- ping in time and frequency with the peaks from other notes. Using the filtered spectral peaks, we compute the pitch salience for each note using the harmonic summation algo- rithm described in [10]. Second, we detect the boundaries of the note using an image processing algorithm. The pitch salience matrix associated to each note is binarized. Then, blobs, namely boundaries and shapes, are detected using