IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 2405
Statistical Analysis of Minimum Classification
Error Learning for Gaussian and
Hidden Markov Model Classifiers
Mohamed Afify, Xinwei Li, and Hui Jiang, Member, IEEE
Abstract—Minimum classification error learning realized
via generalized probabilistic descent, usually referred to as
(MCE/GPD), is a very popular and powerful framework for
building classifiers. This paper first presents a theoretical analysis
of MCE/GPD. The focus is on a simple classification problem for
estimating the means of two Gaussian classes. For this simple
algorithm, we derive difference equations for the class means
and decision threshold during learning, and develop closed form
expressions for the evolution of both the smoothed and true error.
In addition, we show that the decision threshold converges to its
optimal value, and provide an estimate of the number of iterations
needed to approach convergence. After convergence the class
means drift towards increasing their distance to infinity without
contributing to the decrease of the classification error. This be-
havior, referred to as mean drift, is then related to the increase
of the variance of the classifier. The theoretical results perfectly
agree with simulations carried out for a two-class Gaussian clas-
sification problem. In addition to the obtained theoretical results
we experimentally verify, in speech recognition experiments, that
MCE/GPD learning of Gaussian mixture hidden Markov models
qualitatively follows the pattern suggested by the theoretical
analysis. We also discuss links between MCE/GPD learning and
both batch gradient descent and extended Baum–Welch re-esti-
mation. The latter two approaches are known to be popular in
large scale implementations of discriminative training. Hence, the
proposed analysis can be used, at least as a rough guideline, for
better understanding of the properties of discriminative training
algorithms for speech recognition.
Index Terms—Convergence analysis, discriminative learning,
generalized probabilistic descent, hidden Markov models, min-
imum classification error, speech recognition.
I. INTRODUCTION
A
general paradigm for the design of classifiers, based on the
idea of minimizing the classification error, was proposed
in [12] and [16]. In this framework, a smoothed estimate of the
classification error is first formulated and is then minimized,
with respect to the parameters of interest, using gradient de-
scent. Thus, this approach is often referred to as minimum clas-
sification error/generalized probabilistic descent (MCE/GPD).
Manuscript received October 29, 2006; revised May 27, 2007. The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Bill Byrne.
M. Afify is with the IBM T. J. Watson Research Center, Yorktown Heights,
NY 10598 USA.
X. Li is with Nuance, Inc., Burlington, MA 01803 USA (e-mail: xwli@cse.
yorku.ca).
H. Jiang is with York University, Toronto, ON M3J 1P3, Canada.
Digital Object Identifier 10.1109/TASL.2007.903304
Since its introduction, this learning approach has found great
success in many practical, and often large-scale, classification
problems. A large part of these applications focused on speech
recognition and other natural language processing tasks. Inter-
esting reviews of the MCE/GPD framework, that cover both
theory and applications, can be found in [5] and [15].
Due to the success of MCE/GPD in many practical classifier
design problems there were attempts to theoretically analyze its
performance. We mention here the works in [4], [17], [28]. The
relationship between some of these works and the current work
will be highlighted in the paper. In addition to the above works,
the main theoretical justification for the use of the MCE/GPD
paradigm is the generalized probabilistic descent (GPD) the-
orem (refer to [12] for example). This theorem states that the
update equations lead to a decrease of the expected value of the
smoothed error function and converge to one of its local minima
under some regularity assumptions. The goal of this paper is a
more detailed study of the evolution of the classifier parameters
and the objective function during learning.
In order to address these theoretical questions in more detail,
we first focus on a simple learning scenario. The MCE/GPD
algorithm is used to learn the means of a Gaussian classifier for
a two-class problem. This setting leads to a relatively simple
learning algorithm that is amenable to detailed theoretical study.
Our main theoretical contributions are as follows.
• Detailed difference equations for the evolution of the class
means and decision threshold during learning. These equa-
tions are used to prove that the threshold converges to its
optimal value for a sufficiently small constant step size, and
to obtain an estimate of the number of iterations needed to
approach the optimal value. This convergence result is con-
trasted with GPD convergence [12] in the paper.
• Expressions for the smoothed and true error. Using these
expressions, it is shown that the true error converges to its
optimal value and that additional iterations after conver-
gence only reduce the smoothed error and lead to increase
the distance between inter-class means without reducing
the true error. This is referred to as mean drift in the paper.
• An expression for the classifier variance during learning.
This expression is used to establish that further iterations
after threshold convergence will increase the classifier vari-
ance due to the mean drift. This is clearly a negative effect
that needs to be avoided in practice.
The proposed statistical analysis of the algorithm is based on
a framework for the analysis of adaptive algorithms with non-
linearities which was initially proposed in [3] and since then has
1558-7916/$25.00 © 2007 IEEE
Authorized licensed use limited to: York University. Downloaded on June 02,2010 at 08:44:52 UTC from IEEE Xplore. Restrictions apply.