Copyright © 2006 IEEE. Reprinted from Procs. IEEE 2007 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), scheduled for April 16-20, 2007 in Honolulu, Hawaii. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of ECESS's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org . By choosing to view this document, you agree to all provisions of the copyright laws protecting it. EVALUATION OF PITCH DETECTION ALGORITHMS UNDER REAL CONDITIONS Iker Luengo, Ibon Saratxaga, Eva Navas, Inmaculada Hernáez, Jon Sanchez, Iñaki Sainz {iker, ibon, eva, inma, ion, inaki}@bips.bi.ehu.es Aholab-Signal Processing Laboratory – Faculty of Engineering University of the Basque Country Urkijo zum. z/g 48013. Bilbao-Spain. ABSTRACT A novel algorithm based on classical cepstrum calculation followed by dynamic programming is presented in this paper. The algorithm has been evaluated with a 60-minutes database containing 60 speakers and different recording conditions and environments. A second reference database has also been used. In addition, the performance of four popular PDA algorithms has been evaluated with the same databases. The results prove the good performance of the described algorithm in noisy conditions. Furthermore, the paper is a first initiative to perform an evaluation of widely used PDA algorithms over an extensive and realistic database. Index Terms— Speech analysis, Pitch detection 1. INTRODUCTION Pitch detection and marking is a recurrent topic in published papers inside the speech research community. The interest arises naturally from the enormous range of applications and technologies that need and use a pitch detection algorithm (PDA). Precise calculation of the fundamental frequency in the speech signal has demonstrated to be a basic task in almost all areas of speech research, from traditional areas such as speech coding to more recent areas of research like novel speech synthesis techniques or speaker emotional state characterization. Improving on the first proposed methods based on the periodicity of the speech spectrum at voiced segments [1], a great variety of algorithms have been proposed (see [2] for a revision on classic methods). Some of them are very popular, either because they are publicly available or because they come packaged with some software analysis tool [3][4][5][6]. Considering that many users of these packages are not necessarily part of the speech research community (linguists, educators, speech therapists…), setting references and standards for the evaluation of their quality becomes a necessary task. The perfect pitch detector should perform well under any reasonable noise or bandwidth condition. In that respect, several robust pitch detection algorithms have been proposed and claim to perform well under different noise conditions [7][8]. However, up to now no work has been published describing the evaluation of any algorithm with a significant amount of data. Some papers describe the performance of the algorithm only in a qualitative way, using a reduced set of signals and speakers. Others use ad-hoc small to medium size databases (some minutes) with very few speakers (2 to 5). In the last years, two speech databases have been used as reference for evaluation, mainly due to their public availability: the CSTR database and the Keele Pitch Reference database [3][5][9][10][11][12]. The first is about 5 minutes long and contains data from two speakers [9]. The second is about 10 minutes long with speech from five males, five females and five children [13]. This paper presents a novel pitch detection algorithm based on a classic representation (the cepstrum coefficients) followed by dynamic programming. We also present its evaluation comparing its performance with four other popular algorithms, using the CSTR database and a 60 minutes long database recorded by 60 speakers in four different recording channels [14]. Section 2 of this paper is dedicated to the description of the cepstrum-based detection algorithm and the conditions used for the selection of the best path using dynamic programming. Section 3 presents the performed experiments and the results. Conclusions are drawn in Section 4. 2. CEPSTRUM BASED PITCH DETECTION ALGORITHM The proposed algorithm, called CDP, is based on cepstrum calculation followed by a dynamic programming module. After windowing the input signal, a set of pitch candidates is generated. This set is used by the dynamic programming algorithm to select the best pitch curve. As final step this curve is smoothed. 2.1. Selection of F0 candidates The input signal is windowed by means of a Hamming window 58ms long. The length of the window has been chosen as to account for at least two periods in the minimum pitch case. Pitch values in the range of [35Hz-500Hz] have been considered. This range also applies to the selection of the cepstrum coefficients, in such a way that only coefficients with indexes included in [i max , i min ] range will be considered: ⎥ ⎦ ⎥ ⎢ ⎣ ⎢ = ⎥ ⎦ ⎥ ⎢ ⎣ ⎢ = min min max max f F i f F i s s (1) with F s being the sampling frequency. The indexes are calculated as the closest integer to the bracketed expression. Before proceeding to the search of the maximum coefficient whose index is supposed to give the pitch value, the coefficients c i are normalized to the mean value inside the considered frame, giving the normalized coefficient c’ i : c c c i i = ' (2)