IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 2405 Statistical Analysis of Minimum Classification Error Learning for Gaussian and Hidden Markov Model Classifiers Mohamed Afify, Xinwei Li, and Hui Jiang, Member, IEEE Abstract—Minimum classification error learning realized via generalized probabilistic descent, usually referred to as (MCE/GPD), is a very popular and powerful framework for building classifiers. This paper first presents a theoretical analysis of MCE/GPD. The focus is on a simple classification problem for estimating the means of two Gaussian classes. For this simple algorithm, we derive difference equations for the class means and decision threshold during learning, and develop closed form expressions for the evolution of both the smoothed and true error. In addition, we show that the decision threshold converges to its optimal value, and provide an estimate of the number of iterations needed to approach convergence. After convergence the class means drift towards increasing their distance to infinity without contributing to the decrease of the classification error. This be- havior, referred to as mean drift, is then related to the increase of the variance of the classifier. The theoretical results perfectly agree with simulations carried out for a two-class Gaussian clas- sification problem. In addition to the obtained theoretical results we experimentally verify, in speech recognition experiments, that MCE/GPD learning of Gaussian mixture hidden Markov models qualitatively follows the pattern suggested by the theoretical analysis. We also discuss links between MCE/GPD learning and both batch gradient descent and extended Baum–Welch re-esti- mation. The latter two approaches are known to be popular in large scale implementations of discriminative training. Hence, the proposed analysis can be used, at least as a rough guideline, for better understanding of the properties of discriminative training algorithms for speech recognition. Index Terms—Convergence analysis, discriminative learning, generalized probabilistic descent, hidden Markov models, min- imum classification error, speech recognition. I. INTRODUCTION A general paradigm for the design of classifiers, based on the idea of minimizing the classification error, was proposed in [12] and [16]. In this framework, a smoothed estimate of the classification error is first formulated and is then minimized, with respect to the parameters of interest, using gradient de- scent. Thus, this approach is often referred to as minimum clas- sification error/generalized probabilistic descent (MCE/GPD). Manuscript received October 29, 2006; revised May 27, 2007. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Dr. Bill Byrne. M. Afify is with the IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 USA. X. Li is with Nuance, Inc., Burlington, MA 01803 USA (e-mail: xwli@cse. yorku.ca). H. Jiang is with York University, Toronto, ON M3J 1P3, Canada. Digital Object Identifier 10.1109/TASL.2007.903304 Since its introduction, this learning approach has found great success in many practical, and often large-scale, classification problems. A large part of these applications focused on speech recognition and other natural language processing tasks. Inter- esting reviews of the MCE/GPD framework, that cover both theory and applications, can be found in [5] and [15]. Due to the success of MCE/GPD in many practical classifier design problems there were attempts to theoretically analyze its performance. We mention here the works in [4], [17], [28]. The relationship between some of these works and the current work will be highlighted in the paper. In addition to the above works, the main theoretical justification for the use of the MCE/GPD paradigm is the generalized probabilistic descent (GPD) the- orem (refer to [12] for example). This theorem states that the update equations lead to a decrease of the expected value of the smoothed error function and converge to one of its local minima under some regularity assumptions. The goal of this paper is a more detailed study of the evolution of the classifier parameters and the objective function during learning. In order to address these theoretical questions in more detail, we first focus on a simple learning scenario. The MCE/GPD algorithm is used to learn the means of a Gaussian classifier for a two-class problem. This setting leads to a relatively simple learning algorithm that is amenable to detailed theoretical study. Our main theoretical contributions are as follows. Detailed difference equations for the evolution of the class means and decision threshold during learning. These equa- tions are used to prove that the threshold converges to its optimal value for a sufficiently small constant step size, and to obtain an estimate of the number of iterations needed to approach the optimal value. This convergence result is con- trasted with GPD convergence [12] in the paper. Expressions for the smoothed and true error. Using these expressions, it is shown that the true error converges to its optimal value and that additional iterations after conver- gence only reduce the smoothed error and lead to increase the distance between inter-class means without reducing the true error. This is referred to as mean drift in the paper. An expression for the classifier variance during learning. This expression is used to establish that further iterations after threshold convergence will increase the classifier vari- ance due to the mean drift. This is clearly a negative effect that needs to be avoided in practice. The proposed statistical analysis of the algorithm is based on a framework for the analysis of adaptive algorithms with non- linearities which was initially proposed in [3] and since then has 1558-7916/$25.00 © 2007 IEEE Authorized licensed use limited to: York University. Downloaded on June 02,2010 at 08:44:52 UTC from IEEE Xplore. Restrictions apply.