Advanced Acoustic Modeling with the Hybrid HMM/BN Framework Konstantin Markov, Satoshi Nakamura Spoken Language Translation Research Labs, Advanced Telecommunications Research Institute International, Kyoto, Japan {konstantin.markov,satoshi.nakamura}@atr.jp Abstract Most of the current state-of-the-art speech recognition systems are based on HMMs which usually use mix- ture of Gaussian functions as state probability distribu- tion model. It is a common practice to use EM algo- rithm for Gaussian mixture parameter learning. In this case, the learning is done in a ”blind”, data-driven way without taking into account how the speech signal has been produced and which factors it depends on. In this paper, we describe the hybrid HMM/BN acoustic mod- eling framework, where, in contract to the conventional mixture of Gaussians, HMM state probability distribution is modeled by a Bayesian Network, hence the name is HMM/BN. Temporal speech characteristics are still gov- erned by the HMM state transitions, but the state output likelihood is inferred from the BN. This allows for very flexible and consistent models of the state probability dis- tributions which can easily integrate different speech pa- rameterizations. BN can represent various speech fea- tures and environment conditions and their underlying dependencies. We show that the conventional HMM is a special case of HMM/BN model which we regard as a generalization of the HMM. The HMM/BN parame- ter learning is based on the Viterbi training paradigm and consists of two alternating steps - BN training and HMM transition probabilities update. For recognition, in some cases, BN inference is computationally equivalent to mix- ture of Gaussians which allows HMM/BN model to be used in existing HMM decoders. We present several ex- amples of HMM/BN model application in speech recog- nition systems. Evaluations under various conditions and for different tasks showed that the HMM/BN model gives consistently better performance that the standard mixture of Gaussians HMM. 1. Introduction For many years, since the introduction of the HMM for speech recognition [1, 2], observations conditional distri- bution P (x|Q) for each state Q has been modeled most often by mixture of parametric probability density func- tions (pdf). Gaussian as well as Laplacian pdfs are com- monly used for this purpose. Later, a hybrid HMM/NN systems were proposed [3] where Neural Networks are used to estimate HMM state likelihoods given input ob- servation. In most of the cases, features extracted from speech spectrum form these observations. However, re- search in speech recognition has shown that using only these features is not enough to achieve high system per- formance. Thus, many researchers have tried to include additional features representing some other knowledge into their HMM systems. For example, in [4] multi-space probability distribution is proposed for modeling addi- tional pitch information. But, in almost each case, differ- ent approach is taken depending on the properties of the additional feature. There is no common, flexible enough framework to deal with this problem. Recently, the Bayesian Networks (BN) have attracted researchers attention as an alternative modeling tool. BN are well known and studied in Artificial Intelligence re- search field, but in speech recognition, they are rela- tively new research topic. Bayesian Networks can model complex joint probability distributions of many differ- ent (discrete and/or continuous) random variables in well structured and easy to represent way. Especially suitable for modeling temporal speech characteristics are the Dy- namic BN (DBN)[5]. In some of the first reports on DBN in speech recognition, they were used as word models in isolated word recognition tasks [6, 7]. In these works, DBN are regarded as generalization of the HMM, which in addition to speech spectral information can easily in- corporate additional knowledge, such as articulatory fea- tures, sub-band correlation, speaking style, etc. In [8], acoustic features are easily supplemented with pitch in- formation within the framework of DBN. Another advan- tage of the Bayesian Networks is that additional features which are difficult to estimate reliably during recognition may be left hidden, i.e. unobservable. Despite these attractive properties of BN, their application in speech recognition is still limited to small, isolated word recog- nition tasks. The reason is that existing algorithms for BN parameter learning and inference are not practically suitable for continuous speech recognition (CSR) and es- pecially large vocabulary CSR tasks. Although, an ex- tension of the DBN word model allowing recognition of continuously spoken digits was reported in [9], increas- ing task vocabulary even to a few hundred words would make the task intractable. The method we describe in this paper aims at utilizing advantages of both HMM and BN while being free from SPECOM’2004: 9 th Conference Speech and Computer St. Petersburg, Russia September 20-22, 2004 ISCA Archive http://www.isca-speech.org/archive