State mixture modelling applied to speech recognition Dat Tran a, * , Michael Wagner a , Tongtao Zheng b a Human±Computer Communication Laboratory, School of Computing, University of Canberra, Canberra, Australia b School of Asian Languages & Studies, University of Tasmania, Launceston, TAS 7250, Australia Abstract In state mixture modelling (SMM), the temporal structure of the observation sequences is represented by the state joint probability distribution where mixtures of states are considered. This technique is considered in an iterative scheme via maximum likelihood estimation. A fuzzy estimation approach is also introduced to cooperate with the SMM model. This new approach not only saves calculations from 2N T T (HMM direct calculation) and N 2 T (Forward± backward algorithm) to just only 2NT calculations, but also achieves a better recognition result. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Maximum likelihood estimation; Fuzzy estimation; Speech recognition 1. Introduction Let O  o 1 ; o 2 ; ... ; o T be an observation se- quence of a spoken word and let K denote a model parameter set. The basic problem in speech mod- elling is how to compute P OjK eciently, the probability of the observation sequence O, given the model K. The simplest solution is to use a statistical in- dependence assumption between observations only. P OjK is computed as the product of the probabilities of each observation. Computations are simple and in the case that observations are continuous vectors, probability density functions are applicable to model speech data. Its disad- vantage is that the temporal structure of the ob- servation sequence is not taken into account. An application is Gaussian mixture modelling for speaker recognition (Reynolds, 1992). To overcome the above disadvantage, a better solution applied to the hidden Markov model (HMM) is to use hidden state variables modelled as Markov chains. Observations are statistically independent but dependent on states. The tem- poral structure of the observation sequence is represented by Markov chains where state vari- ables are restricted to have a ®nite number of values and the state-transition probabilities are assumed to be time invariant (Rabiner and Juang, 1986). The complete parameter set of the HMM is K fp; A; Bg, where p is the initial state distri- bution, A is the state transition probability dis- tribution and B is the observation symbol probability distribution. Although the application of the HMM to speech recognition has been a success, computations in the HMM are relatively complicated. Consider the computation cost re- quired for evaluating P OjK. If an utterance consists of T acoustic vectors, for an N-state www.elsevier.nl/locate/patrec Pattern Recognition Letters 20 (1999) 1449±1456 * Corresponding author. E-mail addresses: dat@hcc1.canberra.edu.au (D. Tran), miw@hcc1.canberra.edu.au (M. Wagner), Tongtao.Zheng @utas.edu.au (T. Zheng) 0167-8655/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 1 1 3 - 0