Improving a Multiple Pitch Estimation Method With AR Models Tiago Fernandes Tavares 1 , Jayme Garcia Arnal Barbedo 1 , and Amauri Lopes 1 1 School of Electrical and Computer Engineering, University of Campinas, Campinas, SP, Brazil Correspondence should be addressed to Tiago Fernandes Tavares (tavares@dca.fee.unicamp.br) ABSTRACT Multiple pitch estimation (MPE) methods aim to detect the pitches of the sounds that are part of a certain mixture. A possible approach to such problem is applying a FIR ﬁlter bank in the frequency domain and choosing the ﬁlter that presents more energy. This process is equivalent to performing a set of pondered sums of the frequency domain components of a signal. When spectral lobes corresponding to existing partials merge, such process may fail. In this paper, AR models were used to provide an spectral representation where lobes tend to merge less. The proper choice of model signiﬁcantly improved the MPE method. 1. INTRODUCTION Multiple pitch estimation (MPE) is the task in which the objective is detecting what are the pitches of the sounds that are part of a certain mixture. It is often referred as multiple fundamental frequency estimation because of the close relationship between the fundamental fre- quency (F0) of a certain sound and its pitch. Devices performing MPE are often used for automated and semi- automated music analysis, for example in pitch visualiza- tion, voice cancellation and automatic music transcrip- tion tasks. Therefore, improvements in the MPE subsys- tem would be likely to improve more complex systems. A recent method for MPE was proposed by Klapuri [1] and used in later applications such as automatic mu- sic transcription [2, 3] and a vocal remover for karaoke [4]. Due to the number of applications using Klapuri’s method and, in addition, its simplicity, the algorithm was used as basis to this work. In this paper, modiﬁcations to the method are proposed, aiming to improve its accuracy. The basic idea of Klapuri’s approach to MPE is calcu- lating a weighted sum of the spectrum |X [k]| of a certain signal for each F0 candidate f . The weights g f [k] are modeled considering F0 candidates. The result is called saliency (s f ) of a certain F0 candidate [1]. The whole process is described by the following model: s f = K-1 ∑ k=0 g f [k]|X [k]|. (1) Usually, the spectrum |X [k]| is deﬁned as the modulus of the Discrete Fourier Transform (DFT) of a short time frame of the signal. This implies in a spectral repre- sentation where each sinusoidal component corresponds to a lobe with a certain width [5], which may lead to interference between frequency components and harm the performance of MPE algorithms. In this paper, it is shown that different spectral estimation techniques may be used to reduce such interferences, signiﬁcantly im- proving the algorithm’s accuracy, and therefore, poten- tially improving the performance of several audio-related applications. As broadly discussed by Kay [6], the impulse response of a linear Auto-Regressive (AR) model obtained from samples of a framed signal may be used as a spectral esti- mate where sinusoidal components generally correspond to narrower lobes when compared to the DFT. However, obtaining a proper AR model from data is not a trivial task. It consists of determining the optimal model order as well as the best technique to obtain the model param- eters from a limited number of samples, which must be done empirically, aiming to optimize the performance of the application instead of a mean square error optimiza- tion. In this paper, the authors’ own implementation of Kla- puri’s MPE algorithm [1] was modiﬁed so that the spec- tral estimation was performed using AR techniques. Three different methods for obtaining the AR model pa- rameters were used: the correlation and the covariance methods [7], the modiﬁed covariance [8] and Burg’s method [9], which are available in computational pack- ages [10]. Experiments were conducted to discover AES 42 ND INTERNATIONAL CONFERENCE, Ilmenau, Germany, 2011 July 22–24 1