MULTI-RESOLUTION LINEAR PREDICTION BASED FEATURES FOR AUDIO ONSET DETECTION WITH BIDIRECTIONAL LSTM NEURAL NETWORKS Erik Marchi 1 , Giacomo Ferroni 2 , Florian Eyben 1 , Leonardo Gabrielli 2 , Stefano Squartini 2 , Bj¨ orn Schuller 3,1 1 Machine Intelligence & Signal Processing Group, Technische Universit¨ at M ¨ unchen, GERMANY 2 A3LAB, Department of Information Engineering, Universit` a Politecnica delle Marche, ITALY 3 Department of Computing, Imperial College London, UK ABSTRACT A plethora of different onset detection methods have been proposed in the recent years. However few attempts have been made with regard to widely-applicable approaches in order to achieve superior performances over different types of music and with considerable temporal precision. In this paper, we present a multi-resolution approach based on discrete wavelet transform (DWT) and linear prediction ﬁltering (LPF) that improves time resolution and per- formance of onset detection in different musical scenarios. In our approach, wavelet coefﬁcients and forward prediction errors are combined with auditory spectral features and then processed by a bidirectional Long Short-Term Memory recurrent neural network, which acts as reduction function. The network is trained with a large database of onset data covering various genres and onset types. We compare results with state-of-the-art methods on a dataset that includes Bello, Glover and ISMIR 2004 Ballroom sets, and we conclude that our approach signiﬁcantly outperforms existing meth- ods in terms of F -Measure. For pitched non percussive music an absolute improvement of 7.5% is reported. Index Terms— Audio Onset Detection, Linear Prediction, Discrete Wavelet Transform, Neural Networks, Bidirectional Long- Short Term Memory 1. INTRODUCTION Audio Onset Detection (AOD) aims to identify the single temporal instant that characterises the beginning of an acoustic event. Auto- matic detection of events in audio signals is exploited in many au- dio applications including content delivery, compression, indexing, retrieval [1], automatic music transcription [2, 3], and beat detec- tion [4]. A note can be modelled as a sequence of three events [1]: the attack, time extension during which the amplitude envelope in- creases; the transient, during which the signal evolves quickly in some non-trivial ways and it is characterized by non-stationary and abrupt changes in amplitude, phase or spectral content; the onset, the single instant that marks the beginning of the transient. It can be classiﬁed in two main categories: hard and soft onset. The former is characterised by steep attack and abrupt changes (e.g. percussion instruments) that make it simple to detect by analysing the energy, conversely the latter has a smoothed attack (e.g. strings or bowed and wind instruments) for which energy-based onset detection has poor performance. The research leading to these results has received funding from the Euro- pean Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 289021 (ASC-Inclusion). Correspondence should be addressed to erik.marchi@tum.de. Features Extraction x[n] BLSTM-RNN Thresholding Peak-Picking Onset ODF Fig. 1. Basic onset detection block diagram. 1.1. Related work Several onset detection methods have been proposed in the recent years and they traditionally rely only on spectral and/or phase infor- mation. Energy-based approaches [1, 4, 5] show that energy varia- tions are quite reliable in discriminating onset position especially for hard onsets. Other more comprehensive studies attempt to improve soft-onset detection using phase information [1, 5, 6], and combine both energy and phase information to detect any type of onsets [7]. Further studies exploit the multi-resolution analysis [8] getting ad- vantage from the sub-band representation, and apply a psychoacous- tics approach [9, 10] to mimic the human perception of loudness. Finally other methods use the linear prediction error obtaining a new onset detection function [11, 12, 13]. In particular, we compare our proposed method with common approaches such as spectral dif- ference (SD) [1], high frequency content (HFC), spectral ﬂux (SF) [14], and super ﬂux [15] that basically rely on the temporal evolu- tion of the magnitude spectrogram by computing the difference be- tween two consecutive short-time spectra. Furthermore we evaluate other approaches based on auditory spectral features (ASF) [4] and on complex domain (CD) [16]that incorporates magnitude and phase information. 1.2. Contribution A traditional onset detection work-ﬂow is given in Figure 1: the input audio signal x[n] is preprocessed and suitable features are ex- tracted. The feature vectors are then processed by the onset detection function (ODF) before detecting the actual onsets via peak detec- tion function. In this paper we propose a novel approach that relies on Wavelet Coefﬁcients (WC), and Forward Prediction Errors (FPE) envelope to detect the onsets by exploiting the non-stationary prop- erty of the onset [11]. The novel coefﬁcients combined with auditory spectral features [4] are used as input for a Bidirectional Long Short- Term Memory (BLSTM) recurrent neural network [17] which acts as a reduction operator leading to the onset position. We show that our novel approach signiﬁcantly outperforms the other methods from the literature. After detailing the multi-resolution and linear prediction based coefﬁcients in Section 2, we describe the LTSM Neural Networks in Section 3. Section 4 describes the experiments conducted, be-