IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 6, NOVEMBER 2005 1173
Automatic Transcription of Conversational
Telephone Speech
Thomas Hain, Member, IEEE, Philip C. Woodland, Member, IEEE, Gunnar Evermann, Student Member, IEEE,
Mark J. F. Gales, Member, IEEE, Xunying Liu, Student Member, IEEE, Gareth L. Moore, Dan Povey, and
Lan Wang, Student Member, IEEE
Abstract—This paper discusses the Cambridge University
HTK (CU-HTK) system for the automatic transcription of con-
versational telephone speech. A detailed discussion of the most
important techniques in front-end processing, acoustic modeling
and model training, language and pronunciation modeling are pre-
sented. These include the use of conversation side based cepstral
normalization, vocal tract length normalization, heteroscedastic
linear discriminant analysis for feature projection, minimum
phone error training and speaker adaptive training, lattice-based
model adaptation, confusion network based decoding and confi-
dence score estimation, pronunciation selection, language model
interpolation, and class based language models.
The transcription system developed for participation in the 2002
NIST Rich Transcription evaluations of English conversational
telephone speech data is presented in detail. In this evaluation the
CU-HTK system gave an overall word error rate of 23.9%, which
was the best performance by a statistically significant margin.
Further details on the derivation of faster systems with moderate
performance degradation are discussed in the context of the 2002
CU-HTK 10 RT conversational speech transcription system.
Index Terms—Large-vocabulary conversational speech recogni-
tion, telephone speech recognition.
I. INTRODUCTION
T
HE transcription of conversational telephone speech is
one of the most challenging tasks for speech recognition
technology. State-of-the-art systems still yield high word error
rates typically within a range of 20%–30%. Work on this
task has been aided by extensive data collection, namely the
Switchboard-1 corpus [10]. Originally designed as a resource
to train and evaluate speaker identification systems, the corpus
now serves as the primary source of data for work on automatic
transcription of conversational telephone speech in English.
The first reported assessment of word recognition perfor-
mance on the Switchboard-1 corpus was presented in [9] with
an absolute word error rate of around 78%.
1
In this experiment
only a small portion of the Switchboard-1 corpus was used in
training. Over the years the performance of systems on this
Manuscript received December 9, 2003; revised August 9, 2004. This work
was supported by GCHQ and by DARPA under Grant MDA972-02-1-0013.
This paper does not necessarily reflect the position or the policy of the U.S.
Government and no official endorsement should be inferred. The Associate Ed-
itor coordinating the review of this manuscript and approving it for publication
was Dr. Geoffrey Zweig.
The authors are with the Department of Computer Science, University of
Sheffield, Sheffield S1 4DP, U.K. (e-mail: t.hai@dcs.shef.ac.uk).
Digital Object Identifier 10.1109/TSA.2005.852999
1
The focus of the work was topic and speaker identification rather than word
recognition.
task has gradually improved. Progress is assessed in the yearly
“Hub5E” evaluations conducted by the U.S. National Institute
for Standards in Technology (NIST). The Cambridge University
HTK group first entered these evaluations in 1997 using speech
recognition technology based on the Hidden Markov Model
Toolkit (HTK) [37] and has participated in evaluations on this
task ever since. This paper describes the CU-HTK system for
participation in the 2002 NIST Rich Transcription (RT-02)
evaluation. We focus on two test conditions: the unlimited
compute transcription task where the only design objective is
the word error rate (WER); and the less than 10 times real-time
(10 RT) transcription task where the system processing time
is not allowed to exceed 10 times the duration of the speech
signal.
This paper is organized as follows: the first section briefly
reviews basic aspects of the HTK Large Vocabulary Recogni-
tion (LVR) system, followed by a detailed description of the
data used in experiments. In Section IV we present the acoustic
modeling techniques essential to our system and discuss par-
ticular data modeling aspects. Section V outlines the pronunci-
ation modeling, followed in Section VI by a description of the
language models used in our systems. In Section VII we discuss
issues in decoding and system combination. The structure of the
full transcription system is presented in Section VIII, including
a detailed analysis of the performance on large development and
evaluation test sets. This system served as the basis for the 10
RT system described in Section IX.
II. HTK LVR SYSTEMS
The HTK large vocabulary speech recognition systems
are built using the Hidden Markov Model Toolkit [37] and
are based on context dependent state clustered HMM sets
with Gaussian mixture output distributions. The same basic
model training methodology is used for a variety of tasks.
The acoustic data is normally represented by a stream of
39 dimensional feature vectors with a frame spacing of 10
ms, based on 12 Mel-frequency perceptual linear prediction
(MF-PLP) coefficients [33] and the zeroth cepstral coefficient
representing the signal energy. The first and second order
derivatives of each coefficient are appended to form the full
feature vector. The words are mapped into phoneme strings
using dictionaries based on a modified and regularly updated
version of the LIMSI 1993 WSJ pronunciation dictionary [8].
The dictionaries contain multiple pronunciations per word.
Cross-word context-dependent phone models using a context of
either 1 in the case of triphones or 2 for the quinphones are
used as the acoustic models. In addition to models for speech,
1063-6676/$20.00 © 2005 IEEE