Copyright © 2006 IEEE. Reprinted from Procs. IEEE 2007 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007),
scheduled for April 16-20, 2007 in Honolulu, Hawaii. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any
way imply IEEE endorsement of any of ECESS's products or services. Internal or personal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from
the IEEE by writing to pubs-permissions@ieee.org . By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
EVALUATION OF PITCH DETECTION ALGORITHMS UNDER REAL CONDITIONS
Iker Luengo, Ibon Saratxaga, Eva Navas, Inmaculada Hernáez, Jon Sanchez, Iñaki Sainz
{iker, ibon, eva, inma, ion, inaki}@bips.bi.ehu.es
Aholab-Signal Processing Laboratory – Faculty of Engineering
University of the Basque Country
Urkijo zum. z/g 48013. Bilbao-Spain.
ABSTRACT
A novel algorithm based on classical cepstrum calculation
followed by dynamic programming is presented in this paper. The
algorithm has been evaluated with a 60-minutes database
containing 60 speakers and different recording conditions and
environments. A second reference database has also been used. In
addition, the performance of four popular PDA algorithms has
been evaluated with the same databases. The results prove the
good performance of the described algorithm in noisy conditions.
Furthermore, the paper is a first initiative to perform an evaluation
of widely used PDA algorithms over an extensive and realistic
database.
Index Terms— Speech analysis, Pitch detection
1. INTRODUCTION
Pitch detection and marking is a recurrent topic in published
papers inside the speech research community. The interest arises
naturally from the enormous range of applications and
technologies that need and use a pitch detection algorithm (PDA).
Precise calculation of the fundamental frequency in the speech
signal has demonstrated to be a basic task in almost all areas of
speech research, from traditional areas such as speech coding to
more recent areas of research like novel speech synthesis
techniques or speaker emotional state characterization.
Improving on the first proposed methods based on the
periodicity of the speech spectrum at voiced segments [1], a great
variety of algorithms have been proposed (see [2] for a revision on
classic methods). Some of them are very popular, either because
they are publicly available or because they come packaged with
some software analysis tool [3][4][5][6]. Considering that many
users of these packages are not necessarily part of the speech
research community (linguists, educators, speech therapists…),
setting references and standards for the evaluation of their quality
becomes a necessary task.
The perfect pitch detector should perform well under any
reasonable noise or bandwidth condition. In that respect, several
robust pitch detection algorithms have been proposed and claim to
perform well under different noise conditions [7][8].
However, up to now no work has been published describing the
evaluation of any algorithm with a significant amount of data.
Some papers describe the performance of the algorithm only in a
qualitative way, using a reduced set of signals and speakers. Others
use ad-hoc small to medium size databases (some minutes) with
very few speakers (2 to 5). In the last years, two speech databases
have been used as reference for evaluation, mainly due to their
public availability: the CSTR database and the Keele Pitch
Reference database [3][5][9][10][11][12]. The first is about 5
minutes long and contains data from two speakers [9]. The second
is about 10 minutes long with speech from five males, five females
and five children [13].
This paper presents a novel pitch detection algorithm based on
a classic representation (the cepstrum coefficients) followed by
dynamic programming. We also present its evaluation comparing
its performance with four other popular algorithms, using the
CSTR database and a 60 minutes long database recorded by 60
speakers in four different recording channels [14].
Section 2 of this paper is dedicated to the description of the
cepstrum-based detection algorithm and the conditions used for the
selection of the best path using dynamic programming. Section 3
presents the performed experiments and the results. Conclusions
are drawn in Section 4.
2. CEPSTRUM BASED PITCH DETECTION
ALGORITHM
The proposed algorithm, called CDP, is based on cepstrum
calculation followed by a dynamic programming module. After
windowing the input signal, a set of pitch candidates is generated.
This set is used by the dynamic programming algorithm to select
the best pitch curve. As final step this curve is smoothed.
2.1. Selection of F0 candidates
The input signal is windowed by means of a Hamming window
58ms long. The length of the window has been chosen as to
account for at least two periods in the minimum pitch case. Pitch
values in the range of [35Hz-500Hz] have been considered. This
range also applies to the selection of the cepstrum coefficients, in
such a way that only coefficients with indexes included in [i
max
,
i
min
] range will be considered:
⎥
⎦
⎥
⎢
⎣
⎢
=
⎥
⎦
⎥
⎢
⎣
⎢
=
min
min
max
max
f
F
i
f
F
i
s s
(1)
with F
s
being the sampling frequency. The indexes are calculated
as the closest integer to the bracketed expression.
Before proceeding to the search of the maximum coefficient
whose index is supposed to give the pitch value, the coefficients c
i
are normalized to the mean value inside the considered frame,
giving the normalized coefficient c’
i
:
c
c
c
i
i
= ' (2)