JOINT ESTIMATION OF FORMANT TRAJECTORIES VIA SPECTRO-TEMPORAL SMOOTHING AND BAYESIAN TECHNIQUES C. Gläser ∗† , M. Heckmann , F. Joublin , C. Goerick Honda Research Institute Europe GmbH Carl-Legien-Strasse 30, D-63073 Offenbach/Main, Germany {claudius.glaeser, martin.heckmann, frank.joublin, christian.goerick}@honda-ri.de H. M. Groß Technical University of Ilmenau Neuroinformatics and Cognitive Robotics PO Box 10 05 65, D-98693 Ilmenau, Germany horst-michael.gross@tu-ilmenau.de ABSTRACT We propose a method for the joint estimation of formant tra- jectories from spectrograms. Formants are enhanced in the spectrograms obtained from the application of a Gammatone filterbank via a smoothing along the frequency axis. In con- trast to previously published approaches, the used tracking algorithm relies on the joint distribution of formants rather than using independent tracker instances. More precisely, Bayesian mixture filtering in conjunction with adaptive fre- quency range segmentation as well as Bayesian smoothing are used. The algorithm was evaluated on a publicly avail- able database containing hand-labeled formant tracks. Exper- imental results show a significant performance improvement compared to a state of the art approach. Index TermsSpeech processing, Bayes procedures, Tracking, Adaptive estimation, Dynamic programming 1. INTRODUCTION Communication via speech is a key aspect in human-machine interaction. Current speech recognition system work well in idealized environments, but performance significantly drops when environments are characterized by variability. Recogni- tion particularly becomes difficult for speech degraded due to large speaker-microphone distances and noise as in the inter- action with a humanoid robot like Honda’s ASIMO. In contrast, humans perform marvelously well under such conditions. Designing a system based on findings on the func- tional principles of the human auditory system may lead a way to overcome the problems of state of the art systems. It is well known that human speech perception relies to a large extend on formant trajectories. Consequently, we pro- pose a method for extracting formants which might ultimately be more robust to distortions than common feature extraction methods. As shown in Fig. 1, the method involves a biologi- cally inspired preprocessing for the enhancement of formants in spectrograms and a subsequent noise robust tracking via a Bayesian framework in order to extract formant trajectories. The results obtained on a large database with hand-labeled formant trajectories given in the final part of the paper show a significant improvement compared to a state of the art ap- proach. Adaptive frequency range segmentation Speech Formants Spectrogram Formant enhancement Bayesian mixture filtering Bayesian smoothing Bayesian smoothing Bayesian smoothing Fig. 1. The architecture of the formant estimation system. 2. FORMANT ENHANCEMENT Firstly, the speech signal is transformed into the spectro- temporal domain by the application of a Gammatone filter- bank with 128 channels covering the frequency range from 80 Hz to 8 kHz. Furthermore, the envelope of the filter re- sponses is calculated via rectification and low-pass filtering. According to Fant’s linear source-filter theory speech is produced by a non-linear volume velocity source exciting a time-varying linear filter as well as radiation components. Thus, eliminating the spectral influence of excitation and ra- diation will significantly improve the extraction of formants from spectrograms. At least for voiced sounds, the primary source is gener- ated by the vibrating vocal folds converting the subpharyn- geal steady airflow into a quasi-periodic train of flow pulses. In case of most common modal or creaky phonation a second- order low-pass filter can approximate the glottal flow spec- trum [1],[2]. Hence, the glottal spectrum shows a monotoni- cally decreasing characteristic of -12 dB/oct. The principle opening from which speech is radiated is the mouth. A first-order high-pass filter approximates the re- lationship of lip volume velocity and sound pressure received at some distance [3]. For this reason, we model the spec- tral characteristics of the voiced excitation with a drop of -6 dB/oct and correct it via inverse filtering. IV  477 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007