A SIMPLIFIED SUBSPACE GAUSSIAN MIXTURE TO COMPACT ACOUSTIC MODELS FOR
SPEECH RECOGNITION
Mohamed Bouallegue, Driss Matrouf, Georges Linares
LIA, University of Avignon, France
mohamed.bouallegue@etd.univ-avignon.fr
driss.matrouf@univ-avignon.fr,georges.linares@univ-avignon.fr
ABSTRACT
Speech recognition applications are known to require a signif-
icant amount of resources (memory, computing power). How-
ever, embedded speech recognition systems, such as in mobile
phones, only authorizes few KB of memory and few MIPS. In
the context of HMM-based speech recognizers, each HMM-
state distribution is modeled independently from to the other
and has a large amount of parameters. In spite of using state-
tying techniques, the size of the acoustic models stays large
and certain redundancy remains between states. In this paper,
we investigate the capacity of the Subspace Gaussian Mix-
ture approach to reduce the acoustic models size while keep-
ing good performances. We introduce a simplification con-
cerning state specific Gaussians weights estimation, which is
a very complex and time consuming procedure in the origi-
nal approach. With this approach, we show that the acoustic
model size can be reduced by 92% with almost the same per-
formance as the standard acoustic modeling.
Index Terms— Compact Acoustic Models, Subspace
Gaussian Mixture, Embedded speech recognition, Gaussian
Mixture Models, Hidden Markov Models
1. INTRODUCTION
Most of the state-of-the-art Continuous Speech Recognition
Systems are based on Hidden Markov Models that represent
elementary speech units, typically phonemes or triphones.
Usually, the state-level probabilities are estimated by us-
ing Gaussian Mixture Models (GMM) which offers several
advantages : a well established mathematical formalism, au-
tomatic parameter training and improved performance. To
achieve a good performace, the number of states and hence
the number of parameters becomes more and more important
(several tens of parameters). In this paper we deal with the
problem of reducing the number of parameters while keeping
good performances. Previous works deal with this problem,
mainly with the purpose of reducing the memory footprint of
acoustic models [1] [2]. Semi-continuous HMMs (SCHMM)
are based on a Gaussian codebook that is shared between all
HMM states, state models resulting from a specific weighting
of the common Gaussian set [3]. This full Gaussian tying
allows a significant reduction of model complexity but with
a significant accuracy decrease [4]. Some authors extend
this modeling by using compact transformation functions that
map the Gaussian codebook to state-dependent GMMs [1].
In spite of the use of the state-tying technique, the size
of the acoustic models stays large and certain redundancy
remains between states. In this paper we propose to use
the Subspace Gaussian Mixture (SGM) approach to allow a
supplementary reduction of acoustic model size. All state
GMMs are derived from a single GMM called GMM-UBM
(Universal Background Model) with very small specific state
parameters . This approach has some similarities to Eigen-
voice [5] and cluster Adaptive Training [6]. In the SGM
approach the specific state weights are estimated using a
complex and very time consuming procedure. We replace
this procedure using simple EM estimation and by keeping
the N-best weights in each HMM state.
In Section 2, we recall the standard acoustic modeling [7].
In Section 3, we describe the proposed approach for compact-
ing the acoustic models using the SGM approach sketching
the difference to other similar approaches. In Section 4, we
present some experimental results. And finally, in Sections 5
conclusions are proposed.
2. THE STANDARD ACOUSTIC MODEL
STRUCTURE FOR SPEECH RECOGNITION
The baseline system adopted in this work uses a left-to-right
HMM architecture of 10,002 context-dependent phonemes.
To reduce the size of the acoustic models, we used the prin-
ciple of state clustering, where we replace two or more
HMM states having very similar data (GMM parameters are
similar) with a single state by ”tying” these states together [8].
This technique allows us to model 10,002 context-dependent
phonemes by 3327 states instead of 30,006, with 64 Gaus-
sians per state and 39 PLP (Perceptual Linear Predictive)
coefficients per frame (13 static with first and second deriva-
4896 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011