A Learning Scheme for Generating Expressive Music Performances of Jazz Standards Rafael Ramirez and Amaury Hazan Music Technology Group Pompeu Fabra University Ocata 1, 08003 Barcelona, Spain {rafael,ahazan}@iua.upf.es Abstract We describe our approach for generating expressive music performances of monophonic Jazz melodies. It consists of three components: (a) a melodic transcription component which extracts a set of acoustic features from monophonic recordings, (b) a machine learning component which induces an expressive transformation model from the set of ex- tracted acoustic features, and (c) a melody synthe- sis component which generates expressive mono- phonic output (MIDI or audio) from inexpressive melody descriptions using the induced expressive transformation model. In this paper we concentrate on the machine learning component, in particular, on the learning scheme we use for generating ex- pressive audio from a score. 1 Introduction Expressive performance is an important issue in music which has been studied from different perspectives [Gabrielsson, 1999]. The main approaches to empirically study expres- sive performance have been based on statistical analysis (e.g. [Repp, 1992]), mathematical modelling (e.g. [Todd, 1992]), and analysis-by-synthesis (e.g. [Friberg, 1995]). In all these approaches, it is a person who is responsible for devising a theory or mathematical model which captures different as- pects of musical expressive performance. Recently, there has been work on applying machine learning techniques to the study of expressive performance. Widmer [Widmer, 2002] has focused on the task of discovering general rules of ex- pressive classical piano and recognizing famous pianists from their playing style. Lopez de Mantaras et al. [Lopez de Man- taras, 2002] reported on SaxEx, a case-based reasoning sys- tem capable of inferring a set of expressive transformations and applying them to a solo performance in Jazz. In this pa- per we describe an approach to investigate musical expressive performance based on inductive machine learning. In particu- lar, we are interested in monophonic Jazz melodies performed by a saxophonist. Our work differentiates from that of Wid- mer in that, being focused on saxophone Jazz performances, we are interested in intra-note variations (e.g. vibrato) absent in piano, as well as melody alterations (e.g. onset deviations, ornamentations) which are normally considered performance errors in classical music. The work of Lopez de Mantaras et al. is similar to ours but they are unable to explain their pre- dictions. The deviations and changes we consider are on note duration, note onset, note energy, and intra-note features (e.g. attack, vibrato). The study of these variations is the basis of an inductive content-based transformation tool for generating expressive performances of musical pieces. The tool can be divided into three components: a melodic transcription com- ponent, a machine learning component, and a melody synthe- sis component. In the following, we briefly describe each of these components. 2 Melodic description Sound analysis and synthesis techniques based on spectral models are used for extracting high-level symbolic features from the recordings. The sound spectral model analysis tech- niques are based on decomposing the original signal into si- nusoids plus a spectral residual. From the sinusoids of a monophonic signal it is possible to extract information on note pitch, onset, duration, attack and energy, among other high-level information. This information can be modified and the result added back to the spectral representation without loss of quality. We use the software SMSTools which is an ideal tool for preprocessing the signal and providing a high- level description of the audio recordings, as well as for gen- erating an expressive audio according to the transformations obtained by machine learning methods. The low-level descriptors used to characterize the melodic features of our recordings are instantaneous energy and fun- damental frequency. The procedure for computing the de- scriptors is first to divide the audio signal into analysis frames and compute a set of low-level descriptors for each analysis frame. Then, a note segmentation is performed using low- level descriptor values. Once the note boundaries are known, the note descriptors are computed from the low-level and the fundamental frequency values (see [Gomez et al., 2003] for details about the algorithm). 3 Expressive performance knowledge induction Data set. The training data used in our experimental inves- tigations are monophonic recordings of three Jazz standards