PERIODIC SIGNAL EXTRACTION WITH GLOBAL AMPLITUDE AND PHASE MODULATION FOR MUSIC SIGNAL DECOMPOSITION Mahdi Triki, Dirk T.M. Slock Eurecom Institute 2229 route des Crˆ etes, B.P. 193, 06904 Sophia Antipolis Cedex, FRANCE Email: triki,slock @eurecom.fr ABSTRACT A key building block in music transcription and indexing oper- ations is the decomposition of the music signal into notes. We model a note signal as a periodic signal with (slow) global varia- tion of amplitude (reﬂecting attack, sustain, decay) and frequency (limited time warping). The bandlimited variation of global am- plitude and frequency gets expressed through a subsampled repre- sentation and parameterization of the corresponding signals. As- suming additive white Gaussian noise, a Maximum Likelihood ap- proach is proposed for the estimation of the model parameters and the optimization is performed in an iterative (cyclic) fashion that leads to a sequence of simple least-squares problems. Particular at- tention is paid to the estimation of the basic periodic signal, which can have a non-integer period, and the estimation of the amplitude signal with guaranteed positivity. 1. INTRODUCTION Sinusoidal model based music analysis/synthesis has received con- siderable interest in the computer music community [4, 5, 6]. The sinusoidal transform, originally developed by Quatieri and McAulay [3], represents a signal as a sum of discrete time-varying sinusoids or partials: (1) The estimation of the model parameters is typically carried out using a short-time Fourier transform (STFT) with a ﬁxed analy- sis frame size and a ﬁxed stride between frames. The sinusoids are extracted by peak-picking in the STFT magnitude spectrum. Intermediate values are obtained by interpolation. A fundamen- tal problem faced by the traditional sinusoidal-model based tech- niques, and which arises due to the STFT, is smearing of the fre- quency response [8, 7]. In fact, over the period of a single analysis frame, the algorithm estimates the amplitude, frequency and phase of any sinusoids it believes to be present. Because of the near log- arithmic scale of pitch perception, we need very long windows in order to accurately estimate the pitch of low frequency partials. Eur´ ecom’s research is partially supported by its industrial partners: Hasler Foundation, Swisscom, Thales Communications, ST Microelec- tronics, CEGETEL, France T´ el´ ecom, Bouygues Telecom, Hitachi Europe Ltd. and Texas Instruments. The work reported herein was also partially supported by the SIEPIA project of the French RIAM network (Recherche et Innovation en Audiovisuel et Multim´ edia). On the other hand, the time resolution of these parameters is only as ﬁne as the window length, itself. And, since the music signal is strongly non-stationary , it is not always possible to ﬁnd a good tradeoff between time and frequency resolution. Also, determining the sinusoid parameters from the STFT peak amplitude and phase only works well for high frequency resolution, high SNR and in the absence of modulation. Another drawback of these techniques is that they ignore the harmonic structure of the music signal. In fact, they consider the signal as a mixture of a ﬁnite number of arbitrary sinusoids, and not as a periodic signal. For treating periodic signals, the state of the art is limited to the estimation of pure periodic signals with period equal to an integer number of samples [1, 2]. In these ref- erences, the authors propose a Maximum Likelihood approach to analyze pure periodic signals. They show that the resulting pro- cedure can be interpreted as a signal projection onto suitable sub- spaces. This paper extends the results of those references, and tries to merge the modulated sinusoidal modeling and the periodic sig- nal analysis techniques, by considering periodic signals with non- integer period and global amplitude variation and time warping. The use of this model gives a compromise between reality and a parsimonious parameterization. Indeed, global amplitude varia- tion reﬂects mostly attack, sustain, and decay of the whole note signal. Whereas, the global time warping allow the capture of vi- brato and sliding notes. With an eye on future extensions to poly- phonic sounds, the method should be able to work in fairly low SNR. Hence it is important to have parsimonious parameteriza- tions in order to limit the estimation noise. The motivation for the proposed model is to provide a good compromise between approx- imation noise and estimation noise. In music, the nominal frequency of a note is known. So we assume an analysis exploring the hypothesis of the presence of a note at any possible nominal note frequency. However, we do not treat the harmonics of a note signal separately as a simple ﬁlter bank approach would do (this is basically the state of the art in music signal analysis). Rather, the energy in all harmonics is exploited jointly through the treatment of the complete periodic signal, in or- der to robustify the detection of the note signal and the estimation of its modulation characteristics. The Global Modulation (GM) as- sumption helps the separation of note signals that have harmonics in common. This paper is organized as follows. In section (2), the global modulation model is presented. The extraction procedure will then be derived in section (3). Performance of the algorithm is evalu- III - 233 0-7803-8874-7/05/$20.00 ©2005 IEEE ICASSP 2005 ➠ ➡