Context-Dependent Deep Neural Networks for Commercial Mandarin Speech Recognition Applications Jianwei Niu *† , Lei Xie * , Lei Jia and Na Hu * Shaanxi Provincial Key Laboratory of Speech and Image Information Processing School of Computer Science, Northwestern Polytechnical University, Xi’an, China Baidu Inc., Beijing, China Emails: {niujianwei, jialei, huna}@baidu.com, lxie@nwpu.edu.cn Abstract—Recently, context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) have been successfully used in some commercial large-vocabulary English speech recog- nition systems. It has been proved that CD-DNN-HMMs signifi- cantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs (CD-GMM-HMMs). In this paper, we report our latest progress on CD-DNN-HMMs for commercial Mandarin speech recognition applications in Baidu. Experiments demonstrate that CD-DNN-HMMs can get relative 26% word error reduction and relative 16% sentence error reduction in Baidu’s short message (SMS) voice input and voice search appli- cations, respectively, compared with state-of-the-art CD-GMM- HMMs trained using fMPE. To the best of our knowledge, this is the first time the performances of CD-DNN-HMMs are reported for commercial Mandarin speech recognition applications. We also propose a GPU on-chip speed-up training approach which can achieve a speed-up ratio of nearly two for DNN training. I. I NTRODUCTION The main-stream of traditional automatic speech recognition (ASR) system typically uses hidden Markov models (HMMs) to model the evolvement of speech units (e.g., phonemes) and uses Gaussian mixture models (GMMs) to represent the rela- tionship between acoustic inputs and speech units. Speech co- articulation is modeled by context-dependent (CD) units, such as triphones. This is the well-known generative CD-GMM- HMM architecture in the literature. Expectation-maximization (EM) algorithm is usually used for HMM training, while further recognition accuracy improvement can be achieved using discriminative training algorithms [1-3] such as MMI, MCE and MPE, etc. About two decades ago, artificial neural networks (ANNs), as a kind of discriminative model, were also investigated in speech recognition [4-6] with some limited success. In a typical ANN approach, instead of using GMMs, ANNs with a single layer of nonlinear hidden units are used to predict HMM states from acoustic observations. However, due to the limitations of the computation power and the learning algorithms, such a single-hidden-layer ANN approach was not sufficiently powerful to seriously challenge GMMs. As a result, the main practical contribution of ANNs was to provide useful features, namely tandem or bottleneck features [7], in which the posterior probability of each phone was estimated using ANN. With the rapid development of machine learning theory and computer hardware in recent years, it is now capable enough to train a much deeper ANN which contains many layers of non-linear hidden units and a very large output layer. In [8], a new context-dependent deep neural network hidden Markov model (CD-DNN-HMM) was proposed for speech recognition. A significant performance improvement was achieved as com- pared with traditional CD-GMM-HMM. Unlike previous work of ANN in the ASR area, the posterior probability of context dependent triphone state given acoustic input is estimated directly by a DNN with many layers of hidden units. It has been shown that CD-DNN-HMMs can achieve 33% relative word error reduction over discriminatively-trained CD-GMM- HMMs on the switchboard benchmark task [9]. Moreover, CD- DNN-HMMs have been successfully used in several commer- cial large-vocabulary English speech recognition applications, such as the Bing mobile voice search application [10][11], Google voice input speech recognition task [12][11] and Youtube speech recognition task [12], etc. Especially, Google voice input recognition task used about 5870 hours of training data and achieved 23% relative word error reduction compared to the best GMM-based system for this task [12][13]. In this paper, we report our latest progress on CD-DNN- HMMs for commercial Mandarin speech recognition applica- tions in Baidu. First, we demonstrate that CD-DNN-HMMs can be effectively used in large-scale Mandarin speech recog- nition tasks with similar accuracy improvement over CD- GMM-HMMs as in English speech recognition tasks. To our best knowledge, this is the first time the performances of CD- DNN-HMMs are reported for commercial Mandarin speech recognition applications. Second, a new efficient DNN training approach is proposed, in which multiple GPU cards in a single server are utilized in parallel. A speed-up ratio of 1.95 is achieved without recognition accuracy loss. II. TRAINING CD-DNN-HMMS A. CD-DNN-HMM A DNN is actually a conventional multi-layer perceptron (MLP) with more than one layers of hidden units between the input layer and the output layer. In each hidden unit of a