Proceedings of the Second Vienna Talk, Sept. 19–21, 2010, University of Music and Performing Arts Vienna, Austria ROBUST REAL-TIME MUSIC TRACKING Andreas Arzt Department of Computational Perception Johannes Kepler University Linz Gerhard Widmer Department of Computational Perception Johannes Kepler University Linz The Austrian Research Institute for Artificial Intelligence (OFAI) ABSTRACT This paper describes our ’any-time’ real-time music track- ing system which is based on an on-line version of the well-known Dynamic Time Warping (DTW) algorithm and includes some extensions to improve both the pre- cision and the robustness of the alignment (e.g. a tempo model and the ability to reconsider past decisions). A unique feature of our system is the ability to cope with arbitrary structural deviations (e.g. jumps, re-starts) on- line during a live performance. 1. INTRODUCTION In this paper we describe the current state of our real- time music tracking system. The task of a music tracking system is to follow a musical live performance on-line and to output at any time the current position in the score. While most real-time music tracking systems are based on statistical approaches (e.g. the well-known systems by Raphael [8] and Cont [4]), our alignment algorithm is based on an on-line version of the Dynamic Time Warp- ing algorithm first presented by Dixon in [5]. We sub- sequently proposed various improvements and additional features to this algorithm (e.g. a tempo model and the ability to cope with ’jumps’ and ’re-starts’ by the per- former), which are the topic of this paper (see also Figure 1 for an overview of our system). 2. DATA REPRESENTATION Rather than trying to transcribe the incoming audio stream into discrete notes and align the transcription to the score, we first convert a MIDI version of the given score into a sound file by using a software synthesizer. Due to the in- formation stored in the MIDI file, we know the time of every event (e.g. note onsets) in this ‘machine-like’, low- quality rendition of the piece and can treat the problem as a real-time audio-to-audio alignment task. The score audio stream and the live input stream to be aligned are represented as sequences of analysis frames, computed via a windowed FFT of the signal with a ham- ming window of size 46ms and a hop size of 20ms. The data is mapped into 84 frequency bins, spread linearly up to 370Hz and logarithmically above, with semitone spac- ing. In order to emphasize note onsets, which are the most important indicators of musical timing, only the in- crease in energy in each bin relative to the previous frame is stored. 3. ON-LINE DYNAMIC TIME WARPING This algorithm is the core of our real-time music track- ing system. ODTW takes two time series describing the audio signals – one known completely beforehand (the score) and one coming in in real time (the live perfor- mance) –, computes an on-line alignment, and at any time returns the current position in the score. In the follow- ing we only give a short intuitive description of this algo- rithm, for further details we refer the reader to [5]. Dynamic Time Warping (DTW) is an off-line align- ment method for two time series based on a local cost measure and an alignment cost matrix computed using dynamic programming, where each cell contains the costs of the optimal alignment up to this cell. After the matrix computation is completed the optimal alignment path is obtained by tracing the dynamic programming recursion backwards (backward path). Originally proposed by Dixon in [5], the ODTW al- gorithm is based on the standard DTW algorithm, but has two important properties making it useable in real-time systems: the alignment is computed incrementally by al- ways expanding the matrix into the direction (row or col- umn) containing the minimal costs (forward path), and it has linear time and space complexity, as only a fixed number of cells around the forward path is computed. At any time during the alignment it is also possible to compute a backward path starting at the current po- sition, producing an off-line alignment of the two time series which generally is much more accurate. This con- stantly updated, very accurate alignment of the last cou- ple of seconds is used heavily in our system to improve the alignment accuracy (see Section 4). See also Figure 2 for an illustration of the above-mentioned concepts. 4. THE FORWARD-BACKWARD STRATEGY We presented some improvements to this algorithm, fo- cusing both on increasing the precision and the robust-