IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 6, NOVEMBER 2005 1173 Automatic Transcription of Conversational Telephone Speech Thomas Hain, Member, IEEE, Philip C. Woodland, Member, IEEE, Gunnar Evermann, Student Member, IEEE, Mark J. F. Gales, Member, IEEE, Xunying Liu, Student Member, IEEE, Gareth L. Moore, Dan Povey, and Lan Wang, Student Member, IEEE Abstract—This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of con- versational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are pre- sented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confi- dence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 RT conversational speech transcription system. Index Terms—Large-vocabulary conversational speech recogni- tion, telephone speech recognition. I. INTRODUCTION T HE transcription of conversational telephone speech is one of the most challenging tasks for speech recognition technology. State-of-the-art systems still yield high word error rates typically within a range of 20%–30%. Work on this task has been aided by extensive data collection, namely the Switchboard-1 corpus [10]. Originally designed as a resource to train and evaluate speaker identification systems, the corpus now serves as the primary source of data for work on automatic transcription of conversational telephone speech in English. The first reported assessment of word recognition perfor- mance on the Switchboard-1 corpus was presented in [9] with an absolute word error rate of around 78%. 1 In this experiment only a small portion of the Switchboard-1 corpus was used in training. Over the years the performance of systems on this Manuscript received December 9, 2003; revised August 9, 2004. This work was supported by GCHQ and by DARPA under Grant MDA972-02-1-0013. This paper does not necessarily reflect the position or the policy of the U.S. Government and no official endorsement should be inferred. The Associate Ed- itor coordinating the review of this manuscript and approving it for publication was Dr. Geoffrey Zweig. The authors are with the Department of Computer Science, University of Sheffield, Sheffield S1 4DP, U.K. (e-mail: t.hai@dcs.shef.ac.uk). Digital Object Identifier 10.1109/TSA.2005.852999 1 The focus of the work was topic and speaker identification rather than word recognition. task has gradually improved. Progress is assessed in the yearly “Hub5E” evaluations conducted by the U.S. National Institute for Standards in Technology (NIST). The Cambridge University HTK group first entered these evaluations in 1997 using speech recognition technology based on the Hidden Markov Model Toolkit (HTK) [37] and has participated in evaluations on this task ever since. This paper describes the CU-HTK system for participation in the 2002 NIST Rich Transcription (RT-02) evaluation. We focus on two test conditions: the unlimited compute transcription task where the only design objective is the word error rate (WER); and the less than 10 times real-time (10 RT) transcription task where the system processing time is not allowed to exceed 10 times the duration of the speech signal. This paper is organized as follows: the first section briefly reviews basic aspects of the HTK Large Vocabulary Recogni- tion (LVR) system, followed by a detailed description of the data used in experiments. In Section IV we present the acoustic modeling techniques essential to our system and discuss par- ticular data modeling aspects. Section V outlines the pronunci- ation modeling, followed in Section VI by a description of the language models used in our systems. In Section VII we discuss issues in decoding and system combination. The structure of the full transcription system is presented in Section VIII, including a detailed analysis of the performance on large development and evaluation test sets. This system served as the basis for the 10 RT system described in Section IX. II. HTK LVR SYSTEMS The HTK large vocabulary speech recognition systems are built using the Hidden Markov Model Toolkit [37] and are based on context dependent state clustered HMM sets with Gaussian mixture output distributions. The same basic model training methodology is used for a variety of tasks. The acoustic data is normally represented by a stream of 39 dimensional feature vectors with a frame spacing of 10 ms, based on 12 Mel-frequency perceptual linear prediction (MF-PLP) coefficients [33] and the zeroth cepstral coefficient representing the signal energy. The first and second order derivatives of each coefficient are appended to form the full feature vector. The words are mapped into phoneme strings using dictionaries based on a modified and regularly updated version of the LIMSI 1993 WSJ pronunciation dictionary [8]. The dictionaries contain multiple pronunciations per word. Cross-word context-dependent phone models using a context of either 1 in the case of triphones or 2 for the quinphones are used as the acoustic models. In addition to models for speech, 1063-6676/$20.00 © 2005 IEEE