JOINT ESTIMATION OF FORMANT TRAJECTORIES VIA SPECTRO-TEMPORAL
SMOOTHING AND BAYESIAN TECHNIQUES
C. Gläser
∗†
, M. Heckmann
∗
, F. Joublin
∗
, C. Goerick
∗
∗
Honda Research Institute Europe GmbH
Carl-Legien-Strasse 30,
D-63073 Offenbach/Main, Germany
{claudius.glaeser, martin.heckmann,
frank.joublin, christian.goerick}@honda-ri.de
H. M. Groß
†
†
Technical University of Ilmenau
Neuroinformatics and Cognitive Robotics
PO Box 10 05 65, D-98693 Ilmenau, Germany
horst-michael.gross@tu-ilmenau.de
ABSTRACT
We propose a method for the joint estimation of formant tra-
jectories from spectrograms. Formants are enhanced in the
spectrograms obtained from the application of a Gammatone
filterbank via a smoothing along the frequency axis. In con-
trast to previously published approaches, the used tracking
algorithm relies on the joint distribution of formants rather
than using independent tracker instances. More precisely,
Bayesian mixture filtering in conjunction with adaptive fre-
quency range segmentation as well as Bayesian smoothing
are used. The algorithm was evaluated on a publicly avail-
able database containing hand-labeled formant tracks. Exper-
imental results show a significant performance improvement
compared to a state of the art approach.
Index Terms— Speech processing, Bayes procedures,
Tracking, Adaptive estimation, Dynamic programming
1. INTRODUCTION
Communication via speech is a key aspect in human-machine
interaction. Current speech recognition system work well in
idealized environments, but performance significantly drops
when environments are characterized by variability. Recogni-
tion particularly becomes difficult for speech degraded due to
large speaker-microphone distances and noise as in the inter-
action with a humanoid robot like Honda’s ASIMO.
In contrast, humans perform marvelously well under such
conditions. Designing a system based on findings on the func-
tional principles of the human auditory system may lead a
way to overcome the problems of state of the art systems.
It is well known that human speech perception relies to a
large extend on formant trajectories. Consequently, we pro-
pose a method for extracting formants which might ultimately
be more robust to distortions than common feature extraction
methods. As shown in Fig. 1, the method involves a biologi-
cally inspired preprocessing for the enhancement of formants
in spectrograms and a subsequent noise robust tracking via a
Bayesian framework in order to extract formant trajectories.
The results obtained on a large database with hand-labeled
formant trajectories given in the final part of the paper show
a significant improvement compared to a state of the art ap-
proach.
Adaptive
frequency range
segmentation
Speech
Formants
Spectrogram
Formant
enhancement
Bayesian
mixture
filtering
…
Bayesian
smoothing
Bayesian
smoothing
Bayesian
smoothing
Fig. 1. The architecture of the formant estimation system.
2. FORMANT ENHANCEMENT
Firstly, the speech signal is transformed into the spectro-
temporal domain by the application of a Gammatone filter-
bank with 128 channels covering the frequency range from
80 Hz to 8 kHz. Furthermore, the envelope of the filter re-
sponses is calculated via rectification and low-pass filtering.
According to Fant’s linear source-filter theory speech is
produced by a non-linear volume velocity source exciting a
time-varying linear filter as well as radiation components.
Thus, eliminating the spectral influence of excitation and ra-
diation will significantly improve the extraction of formants
from spectrograms.
At least for voiced sounds, the primary source is gener-
ated by the vibrating vocal folds converting the subpharyn-
geal steady airflow into a quasi-periodic train of flow pulses.
In case of most common modal or creaky phonation a second-
order low-pass filter can approximate the glottal flow spec-
trum [1],[2]. Hence, the glottal spectrum shows a monotoni-
cally decreasing characteristic of -12 dB/oct.
The principle opening from which speech is radiated is
the mouth. A first-order high-pass filter approximates the re-
lationship of lip volume velocity and sound pressure received
at some distance [3]. For this reason, we model the spec-
tral characteristics of the voiced excitation with a drop of -6
dB/oct and correct it via inverse filtering.
IV 477 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007