IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 99, NO. 99, NOVEMBER 9999 1 Hyper-spectral microscopic discrimination between normal and cancerous colon biopsies Franco Woolfe*, Mauro Maggioni, Gustave Davis, Frederick Warner, Ronald Coifman, and Steven Zucker Abstract— The spectral study of cancer dates back 50 years, but it is still not known whether spectral measurements suf- fice to distinguish cancerous from normal tissue. An objective approach to that question is designing automatic classifiers for discrimination between these two classes and then estimating generalization error rates. Previous studies have not estimated errors adequately: it is not a priori clear whether unseen spectra from patients in the algorithm’s test set are sufficiently indepen- dent of the training data to provide a fair evaluation. We show experimentally that to obtain accurate error estimations, spectra from unseen patients are necessary. Our results suggest that although spectra are not sufficient to distinguish fully between cancerous and normal tissue, some high degree of discrimination is possible. This leads us to ask how discriminatory spectral features should be selected. The features in previous work on cancer spectroscopy have been chosen according to heuristics. We use the “best basis” algorithm to select a Haar wavelet packet basis which is optimal for the discrimination task at hand. These provide interpretable spectral features consisting of contiguous wavelength bands. However they are outperformed by features which use information from all parts of the spectrum, combined linearly at random. I. I NTRODUCTION H YPER-SPECTRAL imaging for the characterization of cancer dates back more than 50 years [1]. A natural question is whether information in the spectrum is sufficient to distinguish cancerous from normal tissue. To answer it we can design automatic classifiers that are blind to all other information. The error rate of the classifier with respect to the entire population then quantifies the spectral information which this classifier is able to access and which is useful for discriminating between cancerous and normal tissue. The two problems are (1) that there might be information available that a particular classifier misses and (2) that we can never evaluate a classifier on the entire population to find its true error rate. The first problem is one of building an optimal classifier, which is the subject of statistical learning (see for example [2]) and has not been solved in general. Certainly however, using a particular classifier will yield a lower bound on the available information provided that its error rate is estimated reliably. The second problem, estimating the error rate of a classifier given only a finite data sample, has been well studied. The standard solution is cross-validation, introduced by Stone in [3]: one partitions the data at random into a training set (used to build the classifier) and a testing set (used to obtain Manuscript received January 99, 9999; revised November 99, 9999. Mauro Maggioni is with Duke University Mathematics Department. The other authors are with the Yale University Program in Applied Mathematics. * Franco Woolfe is the corresponding author; email: Franco.Woolfe@Yale.edu. one estimate of its error rate). By partitioning many times one can calculate reliable estimates for the true error rate. Unfortunately, for the general problem of statistical model performance estimation “analytical results are difficult, if not impossible” according to [4]. On the other hand, extensive simulation studies [4], [5] have shown the reliability of cross validation empirically. The underlying rationale behind testing a classifier on unseen data is that the unseen data should be independent of that used for training. For example, suppose a classifier is trained on spectra from a certain patient and evaluated on different spectra from the same patient. We call that approach weak cross validation. By contrast we use the term strong cross validation to refer to training on data from some patients and testing on different patients. Molckovsky et al. in [6] introduce their use of weak cross validation saying that “although multiple spectra could be obtained from a large polyp, each Raman spectrum was considered independent”. Other uses of weak cross validation in the literature on cancer recognition algorithms include [7]–[11]. On the other hand some works [12]–[14] make use of strong cross validation and yet others do not specify [15]–[18]. This suggests some researchers may be unaware that the distinction between weak and strong cross validation can be an issue. The question is: are the success rates reported by weak cross validation studies believable estimates for the error rate on the total population? In this paper we answer that question by evaluating our algorithms using both the strong and weak cross validation frameworks, for comparison. Our results indicate that weak cross validation is not sufficient to reliably estimate the out of sample error rate of a classification algorithm: it tends to over estimate success rates. Thus in order to obtain a true lower bound on the useful spectral information content of tissue, for the task of cancer recognition, we must use strong cross validation. Our classi- fication results using that framework do suggest that some information relevant to discrimination between normal and cancerous colon tissue is available in visible light spectra. We proceed to ask whether that information is confined to certain wavelength bands. Is the entire spectrum needed for classifi- cation or do a smaller number of spectral features suffice to extract the available information? Identifying a smaller number of relevant features has the additional advantages that we can decrease computational and image acquisition times since data volume will be lower. What is more, the discrimination will be more straightforward with only a few features due to lower dimensionality of the problem (alleviating the “curse of dimensionality” [2]). Previous studies on how to select