1072 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-17, NO. 6, NOVEMBER/DECEMBER 1987
Learning with Mislabeled Training Samples Using
Stochastic Approximation
AMITA PATHAK-PAL AND SANKAR K. PAL,
SENIOR MEMBER, IEEE
Abstract —For the problem of parameter learning in pattern recognition,
the convergence of stochastic approximation-based learning algorithms
have been investigated for the situation in which mislabeled training
samples are present. In the cases considered, it is found that estimates
converge to nontrue values in the presence of labeling errors. The general
m-class A
r
-feature pattern recognition problem is considered. A possible
solution to the problem is also discussed. Some simulation results are
provided to support the conclusions drawn.
I. INTRODUCTION
The learning of unknown parameters of classifiers is an indis-
pensible part of pattern recognition problems. If a sufficiently
large set of correctly labeled training samples is available, then
" reasonably good" estimates of the parameters can generally be
obtained. In many real-life situations, however, it is either diffi-
cult or expensive to obtain labels, so that mislabeling of training
samples can become one of the specters with which a pattern
recognition scientist has to contend. It is, therefore, useful to
know how this problem can affect the learning procedure. A
reasonable amount of work has been done for the two-class
classification problem. The effects of random training errors on
Fisher's discriminant function have been studied by Lachenbruch
[1], [2]. McLachlan [3], Michalek and Tripathi [4], O'Neill [5],
Krishnan [6], and Katre and Krishnan [7]. They concluded that
the effect is to underestimate distance, overestimate error rate,
introduce bias into estimates of the discriminant function, make
the maximum likelihood estimates of the discriminant function
converge to nontrue values, and change the asymptotic relative
efficiency (ARE) relative to a completely correctly classified
sample of the same size.
In the context of recursive learning of parameters, the useful-
ness of stochastic approximation procedures cannot be overem-
phasized [8].
1
Briefly, a stochastic approximation procedure for
recursively estimating a parameter Θ by θ„ (at the n th stage) with
the help of an unbiased statistic T is
^ι=θ,,-*,,(Λ-Γ,
Μ
ι)
where Θ, is either a constant or Θ, = T
x
, and {a,,} is a suitably
chosen sequence of positive numbers. For instance, a recursive
procedure for estimating the population mean μ of a variable X
utilizing the sample mean x, is
1
X
n + 1 ~
x
n ~~ ~~ \
X
n ~~ *n * 1 ) »
X„
+
! being the (n + l)th observation on X.
In this correspondence, the particular case in which errors
occur in the labeling of training samples is studied for an m-class
TV-feature pattern recognition problem. The effect of mislabeling
is to cause "wrong" samples to be used in the recursive learning
Manuscript received July 12,1986; revised July 15, 1987.
A. Pathak-Pal is with the Electronics and Communications Sciences Unit,
Indian Statistical Institute, 203 B.T. Road, Calcutta 700035, India.
At present, S. K. Pal is with the Centre for Automation Research. Univer-
sity of Maryland, College Park, MD 20742, on leave from the Indian Statisti-
cal Institute, Calcutta 700035, India.
IEEE Log Number 8716929.
*For instance, there are number of works [9]—[13] by Fu and others in which
stochastic approximation techniques, as applied to learning in pattern recogni-
tion systems, are discussed. (It may be added, however, that these are not
related to the present investigation.)
of the estimates, for any given class. A simple but realistic model
[14] is adopted to describe this sort of situation. Under this
model, the authors have investigated the convergence of recursive
learning procedures of the type mentioned above. It is found
that, under certain conditions, these estimates do converge
strongly, that is, with probability one, but to nontrue values,
more specifically, to convex linear combinations of true parame-
ters of all m classes. This conclusion is reached using some
results on multidimensional stochastic approximation [15].
This result, in itself, is not surprising, because the presence of
mislabeled samples in the training set is sure to affect the
behavior of the training process in some way. This work merely
provides a mathematical description of the effect on its conver-
gence.
As this work will seem incomplete without a solution to the
problem considered, we have also discussed in Section V a
possible way of countering the effect of the presence of misla-
beled samples in the training set. The solution consists of
modifying the stochastic approximation procedure in such a
fashion that it becomes restrictive, that is, it does not allow all
training samples to be used for updating. At any given step in the
training process, a sample is used for updating only if it is closer
to the preceding estimate of the mean value than some specified
threshold. Otherwise, it is excluded from the training set. Some
results on the asymptotic behavior of such algorithms are stated.
It is found that under certain conditions these algorithms are
indeed better than the ones considered earlier. Some simulation
results are provided to illustrate the conclusions arrived at in this
work.
II. STATEMENT OF THE PROBLEM
Let us consider a general m-class (C,, /=1,···,#Η) pattern
recognition problem for which an N-dimensional vector
has been specified. Let us assume that
Al) the distribution of X in each class is continuous;
A2) the probability densities p(-\C,) of X for the classes C,,
/ = 1,· · ·, m, are of the same family, and they differ only
in respect of values parameters;
A3) an unbiased statistic exists for the ^-dimensional
parameter-vector φ
</Χ
, with respect to the probability
density function p.
Let us suppose that for the purpose of learning we have been
given a set of independent samples X[
k)
,X^K- ■ -,Χ^\ k =
1,· · ·, m, where the superscripts A' denotes the labels given to the
respective samples. For the learning itself, let us utilize a stochastic
approximation algorithm as defined below.
Let φί
Α)
denote the estimate obtained at the /th step for the
class Q. Then
Φ\" = /(*ί
Α
') (la)
and for / > 1 ,
WA-W-a.W-fiXM), *-l,···.« (lb)
where {a,} is a sequence of positive real numbers such that
a, <lVr and /: R* -» R
q
is an unbiased statistic for <p. This
algorithm is a generalization of the usual stochastic appro-
ximation procedures used for recursive parameter estimation.
III. A MODEL FOR LABELING ERRORS
The model to be used for this purpose was developed by
Chittineni [14]. It can be specified as follows. Let w and w
denote, respectively, the true and the given labels. Clearly,
w, w e {1,2,· · -,m}.
0018-9472/87/1100-1072$01.00 ©1987 IEEE