DRAFT ACCEPTED BY IEEE TRANS. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition George E. Dahl, Student Member, IEEE, Dong Yu, Senior Member, IEEE, Li Deng, Fellow, IEEE, and Alex Acero, Fellow, IEEE Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recog- nition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on per- formance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, context-dependent phone, LVSR, DNN-HMM, ANN-HMM I. I NTRODUCTION E VEN after decades of research and many successfully deployed commercial products, the performance of au- tomatic speech recognition (ASR) systems in real usage sce- narios lags behind human level performance (e.g., [2], [3]). There have been some notable recent advances in discrimina- tive training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16], and boosted MMI [17]), as well as in novel acoustic models (such as conditional random fields (CRFs) [18]–[20], hidden Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. Manuscript received September 5, 2010. This manuscript greatly extends the work presented at ICASSP 2011 [1]. G. E. Dahl is affiliated with the University of Toronto. He contributed to this work while working as an intern at Microsoft Research (email: gdahl@cs.toronto.edu). D. Yu is with the Speech Research Group, Microsoft Research, One Microsoft Way, Redmond, WA, 98034 USA (corresponding author, phone: +1-425-707-9282, fax: +1-425-936-7329, e-mail: dongyu@microsoft.com). L. Deng is with the Speech Research Group, Microsoft Research, One Microsoft Way, Redmond, WA, 98034 USA (email: deng@microsoft.com). A. Acero is with the Speech Research Group, Microsoft Research, One Microsoft Way, Redmond, WA, 98034 USA (email: alexac@microsoft.com). CRFs [21], [22], and segmental CRFs [23]). Despite these advances, the elusive goal of human level accuracy in real- world conditions requires continued, vibrant research. Recently, a major advance has been made in training densely connected, directed belief nets with many hidden layers. The resulting deep belief nets learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns in data. The deep belief net training algorithm suggested in [24] first initializes the weights of each layer individually in a purely unsupervised 1 way and then fine-tunes the entire network using labeled data. This semi-supervised approach using deep models has proved effective in a number of applications, including coding and classification for speech, audio, text, and image data ( [25]–[29]). These advances triggered interest in developing acoustic models based on pre- trained neural networks and other deep learning techniques for ASR. For example, context-independent pre-trained, deep neural network HMM hybrid architectures have recently been proposed for phone recognition [30]–[32] and have achieved very competitive performance. Using pre-training to initialize the weights of a deep neural network has two main potential benefits that have been discussed in the literature. In [33], evidence was presented that is consistent with viewing pre- training as a peculiar sort of data-dependent regularizer whose effect on generalization error does not diminish with more data, even when the dataset is so vast that training cases are never repeated. The regularization effect from using informa- tion in the distribution of inputs can allow highly expressive models to be trained on comparably small quantities of labeled data. Additionally, [34], [33], and others have also reported experimental evidence consistent with pre-training aiding the subsequent optimization, typically performed by stochastic gradient descent. Thus, pre-trained neural networks often also achieve lower training error than neural networks that are not pre-trained (although this effect can often be confounded by the use of early stopping). These effects are especially pronounced in deep autoencoders. Deep belief network pre-training was the first pre-training method to be widely studied, although many other techniques now exist in the literature (e.g. [35]). After [34] showed that deep auto-encoders could be trained effectively using deep belief net pre-training, there was a resurgence of interest in using deeper neural networks for applications. Although less pathological deep architectures than deep autoencoders can in 1 In the context of ASR, we use the term “unsupervised” to mean acoustic data with no transcriptions of any kind.