PREDICTING CLASSIFIER PERFORMANCE WITH A SMALL TRAINING SET: APPLICATIONS TO COMPUTER-AIDED DIAGNOSIS AND PROGNOSIS Ajay Basavanhally, Scott Doyle, Anant Madabhushi ∗ Rutgers, The State University of New Jersey Department of Biomedical Engineering Piscataway, NJ, USA ABSTRACT Selection of an appropriate classiﬁer for computer-aided diagnosis (CAD) applications has typically been an ad hoc process. It is difﬁ- cult to know a priori which classiﬁer will yield high accuracies for a speciﬁc application, especially when well-annotated data for classi- ﬁer training is scarce. In this study, we utilize an inverse power-law model of statistical learning to predict classiﬁer performance when only limited amounts of annotated training data is available. The objectives of this study are to (a) predict classiﬁer error in the con- text of different CAD problems when larger data cohorts become available, and (b) compare classiﬁer performance and trends (both at the sample/patient level and at the pixel level) as additional data is accrued (such as in a clinical trial). In this paper we utilize a power law model to evaluate and compare various classiﬁers (Sup- port Vector Machine (SVM), C4.5 decision tree, k-nearest neigh- bor) for four distinct CAD problems. The ﬁrst two datasets deal with sample/patient-level classiﬁcation for distinguishing between (1) high from low grade breast cancers and (2) high from low lev- els of lymphocytic inﬁltration in breast cancer specimens. The other two datasets are pixel-level classiﬁcation problems for discriminat- ing cancerous and non-cancerous regions on prostate (3) MRI and (4) histopathology. Our empirical results suggest that, given sufﬁ- cient training data, SVMs tend to be the best classiﬁers. This was true for datasets (1), (2), and (3), while the C4.5 decision tree was the best classiﬁer for dataset (4). Our results also suggest that re- sults of classiﬁer comparison made on small data cohorts should not be generalized as holding true when large amounts of data become available. 1. INTRODUCTION Most computer-aided diagnosis (CAD) systems typically involve a supervised classiﬁer that needs to be trained on a set of annotated examples. These training samples are usually provided by a medical expert, who labels the samples according to their class. Unfortu- nately, in many biomedical applications, training data is not abun- dant either due to the cost involved in obtaining expert annotations or because of overall data scarcity. Therefore, classiﬁer choices are often made based on classiﬁcation results from a small number of training samples, which relies on the assumption that the selected classiﬁer will exhibit the same performance when exposed to larger datasets. We aim to demonstrate in this study that classiﬁer trends observed on small cohorts may not necessarily hold true when larger ∗ This work was made possible via grants from the Wallace H. Coulter Foundation, New Jersey Commission on Cancer Research, National Can- cer Institute (R01CA136535-01, R21CA127186-01, R03CA128081-01), the Cancer Institute of New Jersey, and Bioimagene Inc. amounts of data become available. If evaluating a CAD classiﬁer in the context of a clinical trial (where data becomes available se- quentially), one needs to be wary of choosing a classiﬁer based on performance on limited training data. The optimal classiﬁer could change as more data becomes available mid-way through the trial, at which point one is saddled with the initial classiﬁer. Furthermore, the selection of an optimal classiﬁer for a speciﬁc dataset usually requires large amounts of annotated training data [1] since the error rate of a supervised classiﬁer tends to decrease as training set size increases [2]. The objectives of this work are to address certain key issues that arise early in the development of a CAD system. These include: 1. Predicting error rate associated with a classiﬁer assuming that a larger data cohort will become available in the future, and 2. Comparing performance of classiﬁers, at both the sample- and pixel-levels, for large data cohorts based on accuracy pre- dictions from smaller, limited cohorts. The speciﬁc translational implications of this study will be relevant in (a) better design of clinical trials (especially pertaining to CAD systems), and (b) enabling a power analysis of classiﬁers operating on the pixel level (as opposed to patient/sample level), which cannot be currently done via standard sample power calculators. The methodology employed in this paper is based on the work in [3] where an inverse power law was used to model the change in classiﬁcation accuracy for microarray data as a function of train- ing set size. In our approach, a subsampling procedure is used to create multiple training sets at various sizes. The error rates result- ing from evaluation of these training samples is used to determine the three parameters of the power-law model (rate of learning, decay rate, and Bayes error) that characterize the behavior of the error rate as a function of training set size. By calculating these parameters for various classiﬁers, we can intelligently choose the classiﬁer that will yield the optimal accuracy for large training sets. This approach will also allow us to determine whether conclusions derived from classiﬁer comparison studies involving small datasets, are valid in circumstances where larger data cohorts become available. In this work, we apply this method to predict the performance of four clas- siﬁers: Support Vector Machine (SVM) using radial basis function, SVM using linear kernel, k-nearest neighbor, and C4.5 decision tree on four different CAD tasks (see Table 1 and Section 3). 2. EXPERIMENTAL DESIGN 2.1. Overview of Prediction Methodology The general procedure for estimating performance comprises the fol- lowing steps: (1) Generate training sets that will be used to calculate 229 978-1-4244-4126-6/10/$25.00 ©2010 IEEE ISBI 2010 Authorized licensed use limited to: Rutgers University. Downloaded on June 21,2010 at 18:53:55 UTC from IEEE Xplore. Restrictions apply.