JOINT ESTIMATION OF FORMANT TRAJECTORIES VIA SPECTRO-TEMPORAL SMOOTHING AND BAYESIAN TECHNIQUES C. Gläser ∗† , M. Heckmann ∗ , F. Joublin ∗ , C. Goerick ∗ ∗ Honda Research Institute Europe GmbH Carl-Legien-Strasse 30, D-63073 Offenbach/Main, Germany {claudius.glaeser, martin.heckmann, frank.joublin, christian.goerick}@honda-ri.de H. M. Groß † † Technical University of Ilmenau Neuroinformatics and Cognitive Robotics PO Box 10 05 65, D-98693 Ilmenau, Germany horst-michael.gross@tu-ilmenau.de ABSTRACT We propose a method for the joint estimation of formant tra- jectories from spectrograms. Formants are enhanced in the spectrograms obtained from the application of a Gammatone ﬁlterbank via a smoothing along the frequency axis. In con- trast to previously published approaches, the used tracking algorithm relies on the joint distribution of formants rather than using independent tracker instances. More precisely, Bayesian mixture ﬁltering in conjunction with adaptive fre- quency range segmentation as well as Bayesian smoothing are used. The algorithm was evaluated on a publicly avail- able database containing hand-labeled formant tracks. Exper- imental results show a signiﬁcant performance improvement compared to a state of the art approach. Index Terms— Speech processing, Bayes procedures, Tracking, Adaptive estimation, Dynamic programming 1. INTRODUCTION Communication via speech is a key aspect in human-machine interaction. Current speech recognition system work well in idealized environments, but performance signiﬁcantly drops when environments are characterized by variability. Recogni- tion particularly becomes difﬁcult for speech degraded due to large speaker-microphone distances and noise as in the inter- action with a humanoid robot like Honda’s ASIMO. In contrast, humans perform marvelously well under such conditions. Designing a system based on ﬁndings on the func- tional principles of the human auditory system may lead a way to overcome the problems of state of the art systems. It is well known that human speech perception relies to a large extend on formant trajectories. Consequently, we pro- pose a method for extracting formants which might ultimately be more robust to distortions than common feature extraction methods. As shown in Fig. 1, the method involves a biologi- cally inspired preprocessing for the enhancement of formants in spectrograms and a subsequent noise robust tracking via a Bayesian framework in order to extract formant trajectories. The results obtained on a large database with hand-labeled formant trajectories given in the ﬁnal part of the paper show a signiﬁcant improvement compared to a state of the art ap- proach. Adaptive frequency range segmentation Speech Formants Spectrogram Formant enhancement Bayesian mixture filtering … Bayesian smoothing Bayesian smoothing Bayesian smoothing Fig. 1. The architecture of the formant estimation system. 2. FORMANT ENHANCEMENT Firstly, the speech signal is transformed into the spectro- temporal domain by the application of a Gammatone ﬁlter- bank with 128 channels covering the frequency range from 80 Hz to 8 kHz. Furthermore, the envelope of the ﬁlter re- sponses is calculated via rectiﬁcation and low-pass ﬁltering. According to Fant’s linear source-ﬁlter theory speech is produced by a non-linear volume velocity source exciting a time-varying linear ﬁlter as well as radiation components. Thus, eliminating the spectral inﬂuence of excitation and ra- diation will signiﬁcantly improve the extraction of formants from spectrograms. At least for voiced sounds, the primary source is gener- ated by the vibrating vocal folds converting the subpharyn- geal steady airﬂow into a quasi-periodic train of ﬂow pulses. In case of most common modal or creaky phonation a second- order low-pass ﬁlter can approximate the glottal ﬂow spec- trum [1],[2]. Hence, the glottal spectrum shows a monotoni- cally decreasing characteristic of -12 dB/oct. The principle opening from which speech is radiated is the mouth. A ﬁrst-order high-pass ﬁlter approximates the re- lationship of lip volume velocity and sound pressure received at some distance [3]. For this reason, we model the spec- tral characteristics of the voiced excitation with a drop of -6 dB/oct and correct it via inverse ﬁltering. IV  477 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007