NOISE-ROBUST F0 ESTIMATION USING SNR-WEIGHTED SUMMARY CORRELOGRAMS FROM MULTI-BAND COMB FILTERS Lee Ngee Tan and Abeer Alwan Department of Electrical Engineering, University of California, Los Angeles {tleengee, alwan}@ee.ucla.edu ABSTRACT A noise-robust, signal-to-noise ratio (SNR)-weighted correlogram- based pitch estimation algorithm (PEA) in which a bank of comb ﬁlters operates in each of the low, mid, and high frequency bands is proposed. Correlograms are obtained by applying autocorrela- tions directly on the low-freq ﬁlterbank (FBK) output, and the out- put envelopes of all 3 FBKs. An SNR-weighting scheme is used for channel selection to yield a summary correlogram for each FBK. These summary correlograms are averaged to obtain an overall sum- mary correlogram, which is time-smoothed before peak extraction is performed. The ﬁnal pitch contour is obtained via dynamic pro- gramming. The proposed PEA is evaluated on the Keele corpus with additive white or babble noises. In comparison with widely-used PEAs, the proposed PEA has the lowest overall gross pitch error (GPE), especially in low SNR cases. Index Terms— Pitch estimation, correlogram, multi-band, comb ﬁltering, noise-robustness 1. INTRODUCTION Fundamental frequency (F0) or pitch information of voiced speech is required for many speech applications. Although F0 estimation is a well-researched topic, accurate F0 estimation in noise still poses a challenge. Pitch estimation algorithms (PEAs) can be broadly classi- ﬁed into three categories: 1) time-domain, 2) frequency-domain, and 3) time-frequency-domain. Time-domain PEAs directly exploit a signal’s temporal periodicity, which includes zero-crossing rate, av- erage magnitude difference function (AMDF), and autocorrelation- based methods [1–3]. Frequency-domain PEAs estimate F0 using the signal’s short-time spectral harmonicity [4, 5]. Time-frequency domain PEAs typically separate a signal into various frequency bands, and then apply time-domain processing in each band. The auditory-model correlogram-based PEA is a popular time-frequency domain method inspired by Licklider’s duplex theory of pitch per- ception [6]. The signal is ﬁrst decomposed into multiple frequency channels by an auditory ﬁlterbank to model the cochlear frequency analysis function, for which the gammatone auditory ﬁlters [7] are widely used [8–11]. Autocorrelation is then applied directly on ev- ery channel’s output [10] or on its envelope. The latter is generally done on mid and high frequency channels (with center frequencies > 1 kHz) [8, 9], whose wide bandwidths allow the capturing of multiple harmonics, resulting in signal envelopes that oscillate at F0 (beats). Together, these multi-channel autocorrelations form the correlogram, from which single, or possibly multiple F0 candidates are derived. Correlogram-based perceptual PEAs can yield esti- mates close to human’s perceived pitch for signals with a missing Work supported in part by NSF and DARPA fundamental, inharmonic complexes and noise tones [12]. Being a multi-band approach, correlogram-based PEAs have the potential to be noise-robust, especially in the presence of colored noise. Signal processing schemes employing comb ﬁlters have also been proposed for F0 estimation, especially in the presence of noise and harmonic disturbances. A spectral comb analysis technique [5] involving cross-correlation between the spectrum and spectral comb function with teeth of decreasing amplitude, and variable teeth inter- vals, gives more accurate F0 estimates than a cepstrum-based PEA [13]. An adaptive comb ﬁlter was formulated in [14] for pitch es- timation and harmonic enhancement in additive white noise. In the presence of overlapping periodic signals, an F0-tuned comb ﬁlter has been successfully applied to notch or enhance one of the sources, be- fore performing F0 estimation on individual signals [15]. Motivated by the information richness present in the correlo- gram representation, and the harmonic enhancement/suppression ca- pability of comb ﬁlters, the multi-band comb FBK correlogram- based PEA is proposed in this paper. Details on the proposed algo- rithm can be found in Section 2. Section 3 describes the performance evaluation criteria and setup, while Section 4 presents the results of the proposed method in comparison to other PEAs. The ﬁndings are summarized in Section 5. 2. PROPOSED METHOD The block diagram in Fig. 1 summarizes the proposed PEA. 8 kHz Speech Envelope extraction Compute SNR-weighted summary correlograms sR low sR ev mid , sR ev low , sR ev high , sR smooth Compute smoothed overall summary correlogram Peak-extraction and dynamic programming F0 contour 8192-pt FFT Mid-freq comb FBK (1 - 2 kHz) Low-freq comb FBK (0 - 1 kHz) High-freq comb FBK (2 - 3 kHz) Compute IFFT on selected channels SNR (based on inter-harmonic noise) and perform Fig. 1. Block diagram of proposed pitch estimation algorithm. Multi-channel outputs are indicated by bold arrows. 4464 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011