1072 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-17, NO. 6, NOVEMBER/DECEMBER 1987 Learning with Mislabeled Training Samples Using Stochastic Approximation AMITA PATHAK-PAL AND SANKAR K. PAL, SENIOR MEMBER, IEEE Abstract —For the problem of parameter learning in pattern recognition, the convergence of stochastic approximation-based learning algorithms have been investigated for the situation in which mislabeled training samples are present. In the cases considered, it is found that estimates converge to nontrue values in the presence of labeling errors. The general m-class A r -feature pattern recognition problem is considered. A possible solution to the problem is also discussed. Some simulation results are provided to support the conclusions drawn. I. INTRODUCTION The learning of unknown parameters of classifiers is an indis- pensible part of pattern recognition problems. If a sufficiently large set of correctly labeled training samples is available, then " reasonably good" estimates of the parameters can generally be obtained. In many real-life situations, however, it is either diffi- cult or expensive to obtain labels, so that mislabeling of training samples can become one of the specters with which a pattern recognition scientist has to contend. It is, therefore, useful to know how this problem can affect the learning procedure. A reasonable amount of work has been done for the two-class classification problem. The effects of random training errors on Fisher's discriminant function have been studied by Lachenbruch [1], [2]. McLachlan [3], Michalek and Tripathi [4], O'Neill [5], Krishnan [6], and Katre and Krishnan [7]. They concluded that the effect is to underestimate distance, overestimate error rate, introduce bias into estimates of the discriminant function, make the maximum likelihood estimates of the discriminant function converge to nontrue values, and change the asymptotic relative efficiency (ARE) relative to a completely correctly classified sample of the same size. In the context of recursive learning of parameters, the useful- ness of stochastic approximation procedures cannot be overem- phasized [8]. 1 Briefly, a stochastic approximation procedure for recursively estimating a parameter Θ by θ„ (at the n th stage) with the help of an unbiased statistic T is ^ι=θ,,-*,,(Λ-Γ, Μ ι) where Θ, is either a constant or Θ, = T x , and {a,,} is a suitably chosen sequence of positive numbers. For instance, a recursive procedure for estimating the population mean μ of a variable X utilizing the sample mean x, is 1 X n + 1 ~ x n ~~ ~~ \ X n ~~ *n * 1 ) » X„ + ! being the (n + l)th observation on X. In this correspondence, the particular case in which errors occur in the labeling of training samples is studied for an m-class TV-feature pattern recognition problem. The effect of mislabeling is to cause "wrong" samples to be used in the recursive learning Manuscript received July 12,1986; revised July 15, 1987. A. Pathak-Pal is with the Electronics and Communications Sciences Unit, Indian Statistical Institute, 203 B.T. Road, Calcutta 700035, India. At present, S. K. Pal is with the Centre for Automation Research. Univer- sity of Maryland, College Park, MD 20742, on leave from the Indian Statisti- cal Institute, Calcutta 700035, India. IEEE Log Number 8716929. *For instance, there are number of works [9]—[13] by Fu and others in which stochastic approximation techniques, as applied to learning in pattern recogni- tion systems, are discussed. (It may be added, however, that these are not related to the present investigation.) of the estimates, for any given class. A simple but realistic model [14] is adopted to describe this sort of situation. Under this model, the authors have investigated the convergence of recursive learning procedures of the type mentioned above. It is found that, under certain conditions, these estimates do converge strongly, that is, with probability one, but to nontrue values, more specifically, to convex linear combinations of true parame- ters of all m classes. This conclusion is reached using some results on multidimensional stochastic approximation [15]. This result, in itself, is not surprising, because the presence of mislabeled samples in the training set is sure to affect the behavior of the training process in some way. This work merely provides a mathematical description of the effect on its conver- gence. As this work will seem incomplete without a solution to the problem considered, we have also discussed in Section V a possible way of countering the effect of the presence of misla- beled samples in the training set. The solution consists of modifying the stochastic approximation procedure in such a fashion that it becomes restrictive, that is, it does not allow all training samples to be used for updating. At any given step in the training process, a sample is used for updating only if it is closer to the preceding estimate of the mean value than some specified threshold. Otherwise, it is excluded from the training set. Some results on the asymptotic behavior of such algorithms are stated. It is found that under certain conditions these algorithms are indeed better than the ones considered earlier. Some simulation results are provided to illustrate the conclusions arrived at in this work. II. STATEMENT OF THE PROBLEM Let us consider a general m-class (C,, /=1,···,#Η) pattern recognition problem for which an N-dimensional vector has been specified. Let us assume that Al) the distribution of X in each class is continuous; A2) the probability densities p(-\C,) of X for the classes C,, / = 1,· · ·, m, are of the same family, and they differ only in respect of values parameters; A3) an unbiased statistic exists for the ^-dimensional parameter-vector φ </Χ , with respect to the probability density function p. Let us suppose that for the purpose of learning we have been given a set of independent samples X[ k) ,X^K- ■ -,Χ^\ k = 1,· · ·, m, where the superscripts A' denotes the labels given to the respective samples. For the learning itself, let us utilize a stochastic approximation algorithm as defined below. Let φί Α) denote the estimate obtained at the /th step for the class Q. Then Φ\" = /(*ί Α ') (la) and for / > 1 , WA-W-a.W-fiXM), *-l,···.« (lb) where {a,} is a sequence of positive real numbers such that a, <lVr and /: R* -» R q is an unbiased statistic for <p. This algorithm is a generalization of the usual stochastic appro- ximation procedures used for recursive parameter estimation. III. A MODEL FOR LABELING ERRORS The model to be used for this purpose was developed by Chittineni [14]. It can be specified as follows. Let w and w denote, respectively, the true and the given labels. Clearly, w, w e {1,2,· · -,m}. 0018-9472/87/1100-1072$01.00 ©1987 IEEE