PREDICTING CLASSIFIER PERFORMANCE WITH A SMALL TRAINING SET: APPLICATIONS TO COMPUTER-AIDED DIAGNOSIS AND PROGNOSIS Ajay Basavanhally, Scott Doyle, Anant Madabhushi Rutgers, The State University of New Jersey Department of Biomedical Engineering Piscataway, NJ, USA ABSTRACT Selection of an appropriate classier for computer-aided diagnosis (CAD) applications has typically been an ad hoc process. It is dif- cult to know a priori which classier will yield high accuracies for a specic application, especially when well-annotated data for classi- er training is scarce. In this study, we utilize an inverse power-law model of statistical learning to predict classier performance when only limited amounts of annotated training data is available. The objectives of this study are to (a) predict classier error in the con- text of different CAD problems when larger data cohorts become available, and (b) compare classier performance and trends (both at the sample/patient level and at the pixel level) as additional data is accrued (such as in a clinical trial). In this paper we utilize a power law model to evaluate and compare various classiers (Sup- port Vector Machine (SVM), C4.5 decision tree, k-nearest neigh- bor) for four distinct CAD problems. The rst two datasets deal with sample/patient-level classication for distinguishing between (1) high from low grade breast cancers and (2) high from low lev- els of lymphocytic inltration in breast cancer specimens. The other two datasets are pixel-level classication problems for discriminat- ing cancerous and non-cancerous regions on prostate (3) MRI and (4) histopathology. Our empirical results suggest that, given suf- cient training data, SVMs tend to be the best classiers. This was true for datasets (1), (2), and (3), while the C4.5 decision tree was the best classier for dataset (4). Our results also suggest that re- sults of classier comparison made on small data cohorts should not be generalized as holding true when large amounts of data become available. 1. INTRODUCTION Most computer-aided diagnosis (CAD) systems typically involve a supervised classier that needs to be trained on a set of annotated examples. These training samples are usually provided by a medical expert, who labels the samples according to their class. Unfortu- nately, in many biomedical applications, training data is not abun- dant either due to the cost involved in obtaining expert annotations or because of overall data scarcity. Therefore, classier choices are often made based on classication results from a small number of training samples, which relies on the assumption that the selected classier will exhibit the same performance when exposed to larger datasets. We aim to demonstrate in this study that classier trends observed on small cohorts may not necessarily hold true when larger This work was made possible via grants from the Wallace H. Coulter Foundation, New Jersey Commission on Cancer Research, National Can- cer Institute (R01CA136535-01, R21CA127186-01, R03CA128081-01), the Cancer Institute of New Jersey, and Bioimagene Inc. amounts of data become available. If evaluating a CAD classier in the context of a clinical trial (where data becomes available se- quentially), one needs to be wary of choosing a classier based on performance on limited training data. The optimal classier could change as more data becomes available mid-way through the trial, at which point one is saddled with the initial classier. Furthermore, the selection of an optimal classier for a specic dataset usually requires large amounts of annotated training data [1] since the error rate of a supervised classier tends to decrease as training set size increases [2]. The objectives of this work are to address certain key issues that arise early in the development of a CAD system. These include: 1. Predicting error rate associated with a classier assuming that a larger data cohort will become available in the future, and 2. Comparing performance of classiers, at both the sample- and pixel-levels, for large data cohorts based on accuracy pre- dictions from smaller, limited cohorts. The specic translational implications of this study will be relevant in (a) better design of clinical trials (especially pertaining to CAD systems), and (b) enabling a power analysis of classiers operating on the pixel level (as opposed to patient/sample level), which cannot be currently done via standard sample power calculators. The methodology employed in this paper is based on the work in [3] where an inverse power law was used to model the change in classication accuracy for microarray data as a function of train- ing set size. In our approach, a subsampling procedure is used to create multiple training sets at various sizes. The error rates result- ing from evaluation of these training samples is used to determine the three parameters of the power-law model (rate of learning, decay rate, and Bayes error) that characterize the behavior of the error rate as a function of training set size. By calculating these parameters for various classiers, we can intelligently choose the classier that will yield the optimal accuracy for large training sets. This approach will also allow us to determine whether conclusions derived from classier comparison studies involving small datasets, are valid in circumstances where larger data cohorts become available. In this work, we apply this method to predict the performance of four clas- siers: Support Vector Machine (SVM) using radial basis function, SVM using linear kernel, k-nearest neighbor, and C4.5 decision tree on four different CAD tasks (see Table 1 and Section 3). 2. EXPERIMENTAL DESIGN 2.1. Overview of Prediction Methodology The general procedure for estimating performance comprises the fol- lowing steps: (1) Generate training sets that will be used to calculate 229 978-1-4244-4126-6/10/$25.00 ©2010 IEEE ISBI 2010 Authorized licensed use limited to: Rutgers University. Downloaded on June 21,2010 at 18:53:55 UTC from IEEE Xplore. Restrictions apply.