PREDICTING CLASSIFIER PERFORMANCE WITH A SMALL TRAINING SET:
APPLICATIONS TO COMPUTER-AIDED DIAGNOSIS AND PROGNOSIS
Ajay Basavanhally, Scott Doyle, Anant Madabhushi
∗
Rutgers, The State University of New Jersey
Department of Biomedical Engineering
Piscataway, NJ, USA
ABSTRACT
Selection of an appropriate classifier for computer-aided diagnosis
(CAD) applications has typically been an ad hoc process. It is diffi-
cult to know a priori which classifier will yield high accuracies for a
specific application, especially when well-annotated data for classi-
fier training is scarce. In this study, we utilize an inverse power-law
model of statistical learning to predict classifier performance when
only limited amounts of annotated training data is available. The
objectives of this study are to (a) predict classifier error in the con-
text of different CAD problems when larger data cohorts become
available, and (b) compare classifier performance and trends (both
at the sample/patient level and at the pixel level) as additional data
is accrued (such as in a clinical trial). In this paper we utilize a
power law model to evaluate and compare various classifiers (Sup-
port Vector Machine (SVM), C4.5 decision tree, k-nearest neigh-
bor) for four distinct CAD problems. The first two datasets deal
with sample/patient-level classification for distinguishing between
(1) high from low grade breast cancers and (2) high from low lev-
els of lymphocytic infiltration in breast cancer specimens. The other
two datasets are pixel-level classification problems for discriminat-
ing cancerous and non-cancerous regions on prostate (3) MRI and
(4) histopathology. Our empirical results suggest that, given suffi-
cient training data, SVMs tend to be the best classifiers. This was
true for datasets (1), (2), and (3), while the C4.5 decision tree was
the best classifier for dataset (4). Our results also suggest that re-
sults of classifier comparison made on small data cohorts should not
be generalized as holding true when large amounts of data become
available.
1. INTRODUCTION
Most computer-aided diagnosis (CAD) systems typically involve a
supervised classifier that needs to be trained on a set of annotated
examples. These training samples are usually provided by a medical
expert, who labels the samples according to their class. Unfortu-
nately, in many biomedical applications, training data is not abun-
dant either due to the cost involved in obtaining expert annotations
or because of overall data scarcity. Therefore, classifier choices are
often made based on classification results from a small number of
training samples, which relies on the assumption that the selected
classifier will exhibit the same performance when exposed to larger
datasets. We aim to demonstrate in this study that classifier trends
observed on small cohorts may not necessarily hold true when larger
∗
This work was made possible via grants from the Wallace H. Coulter
Foundation, New Jersey Commission on Cancer Research, National Can-
cer Institute (R01CA136535-01, R21CA127186-01, R03CA128081-01), the
Cancer Institute of New Jersey, and Bioimagene Inc.
amounts of data become available. If evaluating a CAD classifier
in the context of a clinical trial (where data becomes available se-
quentially), one needs to be wary of choosing a classifier based on
performance on limited training data. The optimal classifier could
change as more data becomes available mid-way through the trial,
at which point one is saddled with the initial classifier. Furthermore,
the selection of an optimal classifier for a specific dataset usually
requires large amounts of annotated training data [1] since the error
rate of a supervised classifier tends to decrease as training set size
increases [2].
The objectives of this work are to address certain key issues that
arise early in the development of a CAD system. These include:
1. Predicting error rate associated with a classifier assuming that
a larger data cohort will become available in the future, and
2. Comparing performance of classifiers, at both the sample-
and pixel-levels, for large data cohorts based on accuracy pre-
dictions from smaller, limited cohorts.
The specific translational implications of this study will be relevant
in (a) better design of clinical trials (especially pertaining to CAD
systems), and (b) enabling a power analysis of classifiers operating
on the pixel level (as opposed to patient/sample level), which cannot
be currently done via standard sample power calculators.
The methodology employed in this paper is based on the work
in [3] where an inverse power law was used to model the change
in classification accuracy for microarray data as a function of train-
ing set size. In our approach, a subsampling procedure is used to
create multiple training sets at various sizes. The error rates result-
ing from evaluation of these training samples is used to determine
the three parameters of the power-law model (rate of learning, decay
rate, and Bayes error) that characterize the behavior of the error rate
as a function of training set size. By calculating these parameters
for various classifiers, we can intelligently choose the classifier that
will yield the optimal accuracy for large training sets. This approach
will also allow us to determine whether conclusions derived from
classifier comparison studies involving small datasets, are valid in
circumstances where larger data cohorts become available. In this
work, we apply this method to predict the performance of four clas-
sifiers: Support Vector Machine (SVM) using radial basis function,
SVM using linear kernel, k-nearest neighbor, and C4.5 decision tree
on four different CAD tasks (see Table 1 and Section 3).
2. EXPERIMENTAL DESIGN
2.1. Overview of Prediction Methodology
The general procedure for estimating performance comprises the fol-
lowing steps: (1) Generate training sets that will be used to calculate
229 978-1-4244-4126-6/10/$25.00 ©2010 IEEE ISBI 2010
Authorized licensed use limited to: Rutgers University. Downloaded on June 21,2010 at 18:53:55 UTC from IEEE Xplore. Restrictions apply.