A STUDY OF MINIMUM CLASSIFICATION ERROR TRAINING FOR SEGMENTAL SWITCHING LINEAR GAUSSIAN HIDDEN MARKOV MODELS Jian Wu, Donglai Zhu and Qiang Huo Department of Computer Science and Information Systems The University of Hong Kong, Pokfulam Road, Hong Kong, China (Email: jwu@csis.hku.hk, dlzhu@csis.hku.hk, qhuo@csis.hku.hk) ABSTRACT In our previous works, a Switching Linear Gaussian Hidden Markov Model (SLGHMM) and its segmental derivative, SSLGHMM, were proposed to cast the problem of modelling a noisy speech utter- ance by a well-designed dynamic Bayesian network. We presented parameter learning procedures for both models with maximum likelihood (ML) criterion. The effectiveness of such models was confirmed by evaluation experiments on Aurora2 database. In this paper, we present a study of minimum classification error (MCE) training for SSLGHMM and discuss its relation to our ear- lier proposals based on stochastic vector mapping. An important implementation issue of SSLGHMM, namely the specification of switching states for a given utterance, is also studied. New evalua- tion results on Aurora3 database show that MCE-trained SSLGH- MMs achieve a relative error reduction of 21% over a baseline sys- tem based on ML-trained continuous density HMMs (CDHMMs). 1. INTRODUCTION A Switching Linear Gaussian Hidden Markov Model (SLGHMM), as shown in Fig. 1(a), was proposed in [5] to compensate for the nonstationary distortion that may exist in a speech utterance to be recognized. It is a hybrid Dynamic Bayesian Network (DBN) with two coupled streams of dynamic models. One stream is a Con- tinuous Density HMM (CDHMM) to model the generic linguis- tic information of clean speech . Another stream is a Switching Linear Gaussian model to model the nonstationary dis- tortion mechanism with a set of parallel linear Gaussian dynamic streams, (each representing a possible additive stationary distortion in feature vector space), and a discrete-state Markov chain, (controlling the choice of the distortion source at each time step). An SLGHMM with such a mechanism, is thus able to model approximately the distribution of speech, corrupted by switching-condition distortions. In [7], a variational approach has been proposed to solve the approxi- mate maximum likelihood (ML) parameter learning and proba- bilistic inference problems for SLGHMMs. Unfortunately, it is not computationally feasible for ASR applications that require prompt response. Therefore, a Segmental SLGHMM (SSLGHMM here- inafter), as illustrated in Fig. 1(b), was proposed in [5]. In an SSLGHMM, several assumptions are made to simplify the model. Each switch state is assumed to be independent This research was supported by grants from the RGC of the Hong Kong SAR (Project Numbers HKU7022/00E and HKU7039/02E). S t-1 S t S t+1 M t-1 M t M t+1 Y t-1 Y t Y t+1 Q t-1 Q t Q t+1 B (1) t -1 B (1) t B (1) t+1 B (K) t -1 B (K) t B (K) t+1 S t-1 S t S t+1 M t-1 M t M t+1 Y t-1 Y t Y t+1 Q t-1 Q t Q t+1 B (1) t -1 B (1) t B (1) t+1 B (K) t -1 B (K) t B (K) t+1 (a) (b) X t-1 X t X t+1 X t-1 X t X t+1 Fig. 1. Directed acyclic graph specifying conditional indepen- dence relations for (a) SLGHMM and (b) SSLGHMM of all switch states at other time and they are treated as obser- vations of this DBN. The values of are assigned by an appro- priate pre-segmentation procedure. For a particular stream , all of ’s are assumed to follow an i.i.d. (identical independently distributed) distribution . It is assumed that given , where is a zero-mean Gaussian noise with diagonal covariance matrix . It is further as- sumed that each CDHMM of the recognizer is fixed but unknown. It consists of states with transition probability from state to state . Each state has Gaussian components with - dimensional mean vectors and diagonal covariance matrices . denotes the weight of -th Gaussian component in the -th state. Then, the joint likelihood of observations and hidden variables given a set of particular parameters of SSLGHMM, , is YXBSMQ (1) from which the marginal distribution of given parameters can INTERSPEECH 2004 - ICSLP 8 th International Conference on Spoken Language Processing ICC Jeju, Jeju Island, Korea October 4-8, 2004 ISCA Archive http://www.isca-speech.org/archive 10.21437/Interspeech.2004-725