Biometrics 60, 199–206 March 2004 Partially Supervised Learning Using an EM-Boosting Algorithm Yutaka Yasui, 1, ∗ Margaret Pepe, 1, 2 Li Hsu, 1 Bao-Ling Adam, 3 and Ziding Feng 1 1 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109-1024, U.S.A. 2 Department of Biostatistics, University of Washington, Seattle, Washington 98195-7232, U.S.A. 3 Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, Georgia 30912, U.S.A. ∗ email: yyasui@fhcrc.org Summary. Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of super- vised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost. Key words: High-dimensional data; Misclassification; Proteomics. 1. Introduction New biomedical technologies such as gene expression ar- rays and protein mass-spectrometry profiles promise new ap- proaches to the diagnosis of diseases such as cancer. They promise also to provide avenues for disease screening and for predicting the prognosis of patients with disease. Algorithms will need to be developed for using the data generated by these technologies in order to classify patients, for example as having disease or not, as likely to have a good or poor prognosis and so forth. The classification problem has a long history in the field of statistics (Fisher, 1936; Green and Swets, 1966; McLach- lin, 1992) and includes classical methods such as discriminant analysis, logistic regression, and Bayesian decision theory. A key feature of the newer technologies that is not easily ac- commodated by classical methods, however, is that the data generated by them are of high dimension. Supervised learn- ing algorithms that have been developed recently to deal with high-dimensional data for classification are summarized in the book by Hastie, Tibshirani, and Friedman (2001). Boosting is one algorithm that has had great success in applications. It is described by Hastie, Tibshirani, and Friedman as “one of the most powerful learning ideas that has been introduced in the last 10 years,” although there is controversy about how and why it works. We have applied boosting to protein mass- spectrometry data from serum samples of men with prostate cancer and with normal prostate glands. The resulting classi- fication algorithm was almost perfectly accurate when tested on an independent set of samples. In this article we extend boosting in an important way: to accommodate settings where the class label in the training set, Y, is subject to error. This problem arose for us when we tried to apply boosting to discriminate between prostate can- cer cases and abnormal (but apparently noncancer) controls. Men in this latter group were defined by having benign hyper- plasia of the prostate (BPH). Boosting yielded a classification algorithm that did not distinguish very well between the BPH controls and the cancer cases. However, there is evidence from the literature that up to 30% of BPH controls do in fact have cancer (Djavan et al., 2001) and we therefore suspected that the poor performance of boosting was due in part to the fact that it does not accommodate such mislabeling of Y. We denote the training data used for learning by {(Y i , X ∼ i ), i = ...N } where X ∼ i is the set of predictor variables for subject i and Y i is the dichotomous observed label, Y i =1 for a case and Y i = -1 for a control. Although our approach 199