TOWARDS USING HIERARCHICAL POSTERIORS FOR FLEXIBLE AUTOMATIC SPEECH RECOGNITION SYSTEMS Herv´ e Bourlard , Samy Bengio , Mathew Magimai Doss , Qifeng Zhu , Bertrand Mesot , Nelson Morgan IDIAP Research Institute, P.O. Box 592, 1920 Martigny, Switzerland International Computer Science Institute, Berkeley, CA 94704, USA email: bourlard, bengio, mathew, bmesot @idiap.ch, qifeng@icsi.berkeley.edu ABSTRACT Local state (or phone) posterior probabilities are often in- vestigated as local classifiers (e.g., hybrid HMM/ANN sys- tems) or as transformed acoustic features (e.g., “TAN- DEM”) towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Further- more, this approach results in a family of new HMM based systems, where only (local and global) posterior probabil- ities are used, while also providing a new, principled, ap- proach towards a hierarchical use/integration of these pos- teriors, from the frame level up to the sentence level. Ini- tial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers’95 and on a reduced vocabulary version (1000 words) of the DARPA Conversa- tional Telephone Speech-to-text (CTS) task. 1. INTRODUCTION Over the last 10-15 years, posterior probabilities have been increasingly explored as a possible way to improve au- tomatic speech recognition (ASR) systems, initially with the goal of providing more discriminant training and local HMM probabilities, and more recently as compact features (possibly resulting of the merging of several features). Both approaches are certainly valid and have shown some success, e.g., in the case of hybrid HMM/ANN sys- tem (where posteriors are used as local classifiers) or in the case of “TANDEM” systems (where posteriors are used as features fed into standard HMMs). However, their effi- cacy strongly depends on the quality of these posterior es- timates, usually based on statistical tools such as multilayer This project was jointly funded by the DARPA EARS project, the Eu- ropean AMI project, as well as the IM2 Swiss National Center of Compe- tence in Research, which all made this tight collaboration between IDIAP and ICSI possible. perceptrons (MLP) or normalized Gaussian mixture mod- els (GMM), possibly exploiting some contextual acoustic input. In this paper, we present the results of some initial in- vestigation towards new ways to improve the estimation of local posteriors (hence the resulting performance) by us- ing the so called “gamma” recursion (as usually referred to in the HMM formalism) to generate local posteriors tak- ing into account all the acoustic information available in each utterance, possibly complemented by additional prior information. Interestingly, using these “state/phone gam- mas” not only yields improved recognition performance, as shown here on Numbers’95 and CTS, but also opens up sev- eral innovative and principled approaches towards the hier- archical use of posterior probabilities from the frame level up to the sentence level. Finally, we believe that what is presented in the present paper provides a general framework for a theory of using posteriors as local measures (classifiers) or features in hi- erarchical, with the additional advantage of being able to accommodate, and possibly take advantage of, larger acous- tic context, as well as specifically designed prior knowledge such as topological constraints. The notation used in this paper will be the following: an acoustic observation sequence be an HMM state at time , which value can range from 1 to (total number of possible HMM states) be a phoneme at time , which value ranges from 1 to (total number of phones) be a word a time , which value ranges from 1 to (total number of words) Events “ ”, “ ” and “ ” will, respec- tively, often be written as , , and .