ANNOTATIONS TIME SHIFT: A KEY PARAMETER IN EVALUATING MUSICAL NOTE ONSET DETECTION ALGORITHMS Mina Mounir, 1* Peter Karsmakers, 2 Toon van Waterschoot, 1 1 KU Leuven, Dpt. of Electrical Engineering, ESAT-STADIUS/ETC, 3001 Leuven, Belgium mina.mounir@esat.kuleuven.be, toon.vanwaterschoot@esat.kuleuven.be 2 KU Leuven, Dpt. of Computer Science, TC CS-ADVISE, B-2440 GEEL, Belgium. peter.karsmakers@kuleuven.be ABSTRACT Musical note onset detection is a building component for several MIR related tasks. The ambiguity in the definition of a note onset and the lack of a standard way to annotate onsets, introduce differ- ences in datasets labeling, which in turn makes evaluations of note onset detection algorithms difficult to compare. This paper gives an overview of the parameters influencing the commonly used on- set detection evaluation measure, i.e. the F1-score, pointing out a consistently missing parameter which is the overall time shift in annotations. This paper shows how crucial this parameter is in mak- ing reported F1-scores comparable among different algorithms and datasets, achieving a more reliable evaluation. As several MIR ap- plications are concerned with the relative location of onsets to each other and not their absolute location, this paper suggests to include the overall time shift as a parameter when evaluating the algorithm performance. Experiments show a strong variability in the reported F1-score and up to 50% increase in the best-case F1-score when varying the overall time shift. Optimizing the time shift turns out to be crucial when training or testing algorithms with datasets that are annotated differently (e.g. manually, automatically, and with differ- ent annotators) and especially when using deep learning algorithms. Index TermsMusical note onsets, evaluation, acoustic event detection, machine learning 1. INTRODUCTION Detecting the onsets of musical notes is like a hide-and-seek game in which we are trying to chase the starting of musical notes in a piece of music. Usually in those kind of games we have a good idea of what we hide and this moves the whole problem difficulty to the seeking part. For note onset detection this is not the case as litera- ture provides a variety of definitions for a note start. For instance it could be either when a note is triggered or when it is perceived [1]. Considering a musical note as a sequence of a transient followed by a steady-state component [2], an onset is the point chosen to mark the transient [3] or more precisely it should be as close as possible to the transient’s start [4]. But again a transient length depends on in- strument and playing style and there is no objective way to measure how close is the onset to the transient’s start. * This research work was carried out at the ESAT Laboratory of KU Leu- ven. The research leading to these results has received funding from the KU Leuven Internal Funds C2-16-00449, IMP/14/037, and VES/19/004, and the European Research Council under the European Union’s Horizon 2020 re- search and innovation program / ERC Consolidator Grant: SONORA (no. 773268). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Even if note onset detection is an established research problem, it is always capturing researchers’ interest. On one hand, there is still quite some room for performance improvement. On the other hand, it plays a core role in a variety of music signal processing (adaptive audio effects [3], music synthesis [5] ) and MIR applica- tions (automatic music transcription [6], recommender systems [7] and music fingerprinting/search systems [8], [9], [10]). Preprocessing Onset Detection Function Post-processing Evaluation? Figure 1: General scheme for note onset detection. Literature is rich with many solutions proposed for detecting note onsets. Each solution starts by deciding on an onset defini- tion which may depend on several factors: application, target in- struments, available datasets, labeling method, availability of an- notators, etc. The selected definition has a fundamental impact on how the ground-truth annotations are generated which may in turn drastically affect how the detections are evaluated. In many MIR applications like tempo estimation, songs search engines and music synthesis, algorithms make more use of onsets’ relative positions rather than their absolute positions. On the other hand, for applica- tions seeking exact onset times, the latter are generally defined with certain tolerance and adapted to the target application. Before analyzing the performance evaluation, we first summa- rize the seeking part of the game. It usually follows a certain scheme of three steps [3] illustrated in Fig.1. Most of the research is fo- cused on the middle step, trying to come up with a better Onset De- tection Function (ODF) which is defined as a highly sub-sampled version of the input music signal presenting distinguishable ampli- tude peaks corresponding to onset locations. Existing ODFs are grouped in two main classes: probabilistic and non-probabilistic. Referring to MIREX results for the last years [11], the best per- forming state-of-the-art non-probabilistic algorithm was fluctuating between ComplexFlux [12] and SuperFlux [13] which are based on LogSpecFlux(LSF) [14][15], a method detecting onsets by spec-