IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 6, JUNE 2013 1285
Eigentriphones for Context-Dependent
Acoustic Modeling
Tom Ko, Student Member, IEEE, and Brian Mak, Senior Member, IEEE
Abstract—Most automatic speech recognizers employ tied-state
triphone hidden Markov models (HMM), in which the corre-
sponding triphone states of the same base phone are tied. State
tying is commonly performed with the use of a phonetic regression
class tree which renders robust context-dependent modeling pos-
sible by carefully balancing the amount of training data with the
degree of tying. However, tying inevitably introduces quantization
error: triphones tied to the same state are not distinguishable in
that state. Recently we proposed a new triphone modeling ap-
proach called eigentriphone modeling in which all triphone models
are, in general, distinct. The idea is to create an eigenbasis for
each base phone (or phone state) and all its triphones (or triphone
states) are represented as distinct points in the space spanned
by the basis. We have shown that triphone HMMs trained using
model-based or state-based eigentriphones perform at least as
well as conventional tied-state HMMs. In this paper, we further
generalize the definition of eigentriphones over clusters of acoustic
units. Our experiments on TIMIT phone recognition and the Wall
Street Journal 5K-vocabulary continuous speech recognition show
that eigentriphones estimated from state clusters defined by the
nodes in the same phonetic regression class tree used in state tying
result in further performance gain.
Index Terms—Eigentriphone, tied state, context dependency,
regularization, weighted PCA.
I. INTRODUCTION
A
CRITICAL issue in context-dependent (CD) acoustic
modeling is how to robustly estimate the model param-
eters of the rarely occurring acoustic units. For instance, it is
found that the distribution of triphones in the HUB2 training set
of the Wall Street Journal corpus [1] obeys the Pareto Principle
(or the 80/20 Rule) [2]: roughly 80% of triphone occurrences
in the corpus come from 20% of all the distinct triphones in
the corpus [3]. Naive maximum-likelihood (ML) estimation
of the hidden Markov model (HMM) parameters of these
infrequent context-dependent acoustic units will produce poor
triphone models, which will affect the overall performance of
an automatic speech recognition (ASR) system. Past solutions
for robust estimation of CD acoustic models may be roughly
classified into three categories: triphone-by-composition [4],
parameter tying [5], and a basis approach.
Manuscript received August 18, 2012; revised December 24, 2012; accepted
February 07, 2013. Date of publication February 25, 2013; date of current ver-
sion March 13, 2013. This work was supported in part by the Research Grants
Council of the Hong Kong SAR under Grants SRFI11EG15, FSGRF12EG31,
and FSGRF13EG20. The associate editor coordinating the review of this man-
uscript and approving it for publication was Prof. Thomas Fang Zheng.
The authors are with the Department of Computer Science and Engineering,
the Hong Kong University of Science and Technology, Clear Water Bay, Hong
Kong (e-mail: tomko@cse.ust.hk; mak@cse.ust.hk).
Digital Object Identifier 10.1109/TASL.2013.2248722
Model interpolation [6] and quasi-triphones [7] are typical
examples of the triphone-by-composition method. In both
examples, CD models are constructed by combining triphone
models that may not be well trained with robustly trained
acoustic models that capture weaker contextual information.
For instance, in [6], a triphone state distribution is generated
by a linear combination of its ML estimate and the state distri-
butions from its corresponding left-context-dependent model,
right-context-dependent model, and/or context-independent
model using deleted interpolation. In [7], it is assumed that the
left context of a phone influences mostly its beginning whereas
its right context influences mostly its ending. Thus, a three-state
triphone model is generated in such a way that the first and the
last states are conditioned only on its left and right contexts
respectively, whereas the middle state is context-independent.
Recently, another example of triphone-by-composition called
back-off acoustic modeling [8] was proposed. The new method
combines the score of a triphone with scores from triphones
that are estimated under broad phonetic class contexts of its left
and right phones.
Parameter tying is another solution that is widely used in
ASR systems because of its proven effectiveness in simulta-
neously reducing model size and enhancing recognition speed.
Various HMM parameters have been tied successfully, for ex-
ample, generalized triphones [9], tied states [10], shared distri-
butions or senones [11], and tied subspace Gaussian distribu-
tions [12]. Among these parameter tying methods, state tying
[10] is probably the most popular approach in context-depen-
dent acoustic modeling. The degree of state tying—that is, the
number of tied states—can be well managed by a (binary) re-
gression class tree, using questions that are based on acoustics
[13] or phonetic knowledge [14]. The use of a phonetic regres-
sion class tree offers the additional benefit of synthesizing un-
seen triphones in the test lexicon.
Recently, a basis approach is emerging. In the basis approach,
one or more bases are constructed so that model parameters
may be derived from the basis vectors or functions. For ex-
ample, semi-continuous hidden Markov model (SCHMM) [15],
[16] and subspace Gaussian mixture model (SGMM) [17] both
employ a basis of Gaussians, whereas Bayesian sensing HMM
[18] uses sets of state-dependent basis vectors. Similarly, in the
canonical state model (CSM) [19] framework, there exists a fi-
nite set of canonical states from which every context-dependent
state in an ASR system is transformed. The set of canonical
states captures the relationship between the context-dependent
states through some transformation functions. It has been shown
that both SCHMM and SGMM can be derived from the CSM
framework.
1558-7916/$31.00 © 2013 IEEE