IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 6, JUNE 2013 1285 Eigentriphones for Context-Dependent Acoustic Modeling Tom Ko, Student Member, IEEE, and Brian Mak, Senior Member, IEEE Abstract—Most automatic speech recognizers employ tied-state triphone hidden Markov models (HMM), in which the corre- sponding triphone states of the same base phone are tied. State tying is commonly performed with the use of a phonetic regression class tree which renders robust context-dependent modeling pos- sible by carefully balancing the amount of training data with the degree of tying. However, tying inevitably introduces quantization error: triphones tied to the same state are not distinguishable in that state. Recently we proposed a new triphone modeling ap- proach called eigentriphone modeling in which all triphone models are, in general, distinct. The idea is to create an eigenbasis for each base phone (or phone state) and all its triphones (or triphone states) are represented as distinct points in the space spanned by the basis. We have shown that triphone HMMs trained using model-based or state-based eigentriphones perform at least as well as conventional tied-state HMMs. In this paper, we further generalize the denition of eigentriphones over clusters of acoustic units. Our experiments on TIMIT phone recognition and the Wall Street Journal 5K-vocabulary continuous speech recognition show that eigentriphones estimated from state clusters dened by the nodes in the same phonetic regression class tree used in state tying result in further performance gain. Index Terms—Eigentriphone, tied state, context dependency, regularization, weighted PCA. I. INTRODUCTION A CRITICAL issue in context-dependent (CD) acoustic modeling is how to robustly estimate the model param- eters of the rarely occurring acoustic units. For instance, it is found that the distribution of triphones in the HUB2 training set of the Wall Street Journal corpus [1] obeys the Pareto Principle (or the 80/20 Rule) [2]: roughly 80% of triphone occurrences in the corpus come from 20% of all the distinct triphones in the corpus [3]. Naive maximum-likelihood (ML) estimation of the hidden Markov model (HMM) parameters of these infrequent context-dependent acoustic units will produce poor triphone models, which will affect the overall performance of an automatic speech recognition (ASR) system. Past solutions for robust estimation of CD acoustic models may be roughly classied into three categories: triphone-by-composition [4], parameter tying [5], and a basis approach. Manuscript received August 18, 2012; revised December 24, 2012; accepted February 07, 2013. Date of publication February 25, 2013; date of current ver- sion March 13, 2013. This work was supported in part by the Research Grants Council of the Hong Kong SAR under Grants SRFI11EG15, FSGRF12EG31, and FSGRF13EG20. The associate editor coordinating the review of this man- uscript and approving it for publication was Prof. Thomas Fang Zheng. The authors are with the Department of Computer Science and Engineering, the Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong (e-mail: tomko@cse.ust.hk; mak@cse.ust.hk). Digital Object Identier 10.1109/TASL.2013.2248722 Model interpolation [6] and quasi-triphones [7] are typical examples of the triphone-by-composition method. In both examples, CD models are constructed by combining triphone models that may not be well trained with robustly trained acoustic models that capture weaker contextual information. For instance, in [6], a triphone state distribution is generated by a linear combination of its ML estimate and the state distri- butions from its corresponding left-context-dependent model, right-context-dependent model, and/or context-independent model using deleted interpolation. In [7], it is assumed that the left context of a phone inuences mostly its beginning whereas its right context inuences mostly its ending. Thus, a three-state triphone model is generated in such a way that the rst and the last states are conditioned only on its left and right contexts respectively, whereas the middle state is context-independent. Recently, another example of triphone-by-composition called back-off acoustic modeling [8] was proposed. The new method combines the score of a triphone with scores from triphones that are estimated under broad phonetic class contexts of its left and right phones. Parameter tying is another solution that is widely used in ASR systems because of its proven effectiveness in simulta- neously reducing model size and enhancing recognition speed. Various HMM parameters have been tied successfully, for ex- ample, generalized triphones [9], tied states [10], shared distri- butions or senones [11], and tied subspace Gaussian distribu- tions [12]. Among these parameter tying methods, state tying [10] is probably the most popular approach in context-depen- dent acoustic modeling. The degree of state tying—that is, the number of tied states—can be well managed by a (binary) re- gression class tree, using questions that are based on acoustics [13] or phonetic knowledge [14]. The use of a phonetic regres- sion class tree offers the additional benet of synthesizing un- seen triphones in the test lexicon. Recently, a basis approach is emerging. In the basis approach, one or more bases are constructed so that model parameters may be derived from the basis vectors or functions. For ex- ample, semi-continuous hidden Markov model (SCHMM) [15], [16] and subspace Gaussian mixture model (SGMM) [17] both employ a basis of Gaussians, whereas Bayesian sensing HMM [18] uses sets of state-dependent basis vectors. Similarly, in the canonical state model (CSM) [19] framework, there exists a - nite set of canonical states from which every context-dependent state in an ASR system is transformed. The set of canonical states captures the relationship between the context-dependent states through some transformation functions. It has been shown that both SCHMM and SGMM can be derived from the CSM framework. 1558-7916/$31.00 © 2013 IEEE