A SIMPLIFIED SUBSPACE GAUSSIAN MIXTURE TO COMPACT ACOUSTIC MODELS FOR SPEECH RECOGNITION Mohamed Bouallegue, Driss Matrouf, Georges Linares LIA, University of Avignon, France mohamed.bouallegue@etd.univ-avignon.fr driss.matrouf@univ-avignon.fr,georges.linares@univ-avignon.fr ABSTRACT Speech recognition applications are known to require a signif- icant amount of resources (memory, computing power). How- ever, embedded speech recognition systems, such as in mobile phones, only authorizes few KB of memory and few MIPS. In the context of HMM-based speech recognizers, each HMM- state distribution is modeled independently from to the other and has a large amount of parameters. In spite of using state- tying techniques, the size of the acoustic models stays large and certain redundancy remains between states. In this paper, we investigate the capacity of the Subspace Gaussian Mix- ture approach to reduce the acoustic models size while keep- ing good performances. We introduce a simplification con- cerning state specific Gaussians weights estimation, which is a very complex and time consuming procedure in the origi- nal approach. With this approach, we show that the acoustic model size can be reduced by 92% with almost the same per- formance as the standard acoustic modeling. Index TermsCompact Acoustic Models, Subspace Gaussian Mixture, Embedded speech recognition, Gaussian Mixture Models, Hidden Markov Models 1. INTRODUCTION Most of the state-of-the-art Continuous Speech Recognition Systems are based on Hidden Markov Models that represent elementary speech units, typically phonemes or triphones. Usually, the state-level probabilities are estimated by us- ing Gaussian Mixture Models (GMM) which offers several advantages : a well established mathematical formalism, au- tomatic parameter training and improved performance. To achieve a good performace, the number of states and hence the number of parameters becomes more and more important (several tens of parameters). In this paper we deal with the problem of reducing the number of parameters while keeping good performances. Previous works deal with this problem, mainly with the purpose of reducing the memory footprint of acoustic models [1] [2]. Semi-continuous HMMs (SCHMM) are based on a Gaussian codebook that is shared between all HMM states, state models resulting from a specific weighting of the common Gaussian set [3]. This full Gaussian tying allows a significant reduction of model complexity but with a significant accuracy decrease [4]. Some authors extend this modeling by using compact transformation functions that map the Gaussian codebook to state-dependent GMMs [1]. In spite of the use of the state-tying technique, the size of the acoustic models stays large and certain redundancy remains between states. In this paper we propose to use the Subspace Gaussian Mixture (SGM) approach to allow a supplementary reduction of acoustic model size. All state GMMs are derived from a single GMM called GMM-UBM (Universal Background Model) with very small specific state parameters . This approach has some similarities to Eigen- voice [5] and cluster Adaptive Training [6]. In the SGM approach the specific state weights are estimated using a complex and very time consuming procedure. We replace this procedure using simple EM estimation and by keeping the N-best weights in each HMM state. In Section 2, we recall the standard acoustic modeling [7]. In Section 3, we describe the proposed approach for compact- ing the acoustic models using the SGM approach sketching the difference to other similar approaches. In Section 4, we present some experimental results. And finally, in Sections 5 conclusions are proposed. 2. THE STANDARD ACOUSTIC MODEL STRUCTURE FOR SPEECH RECOGNITION The baseline system adopted in this work uses a left-to-right HMM architecture of 10,002 context-dependent phonemes. To reduce the size of the acoustic models, we used the prin- ciple of state clustering, where we replace two or more HMM states having very similar data (GMM parameters are similar) with a single state by ”tying” these states together [8]. This technique allows us to model 10,002 context-dependent phonemes by 3327 states instead of 30,006, with 64 Gaus- sians per state and 39 PLP (Perceptual Linear Predictive) coefficients per frame (13 static with first and second deriva- 4896 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011