Biometrics 60, 199–206 March 2004 Partially Supervised Learning Using an EM-Boosting Algorithm Yutaka Yasui, 1, ∗ Margaret Pepe, 1, 2 Li Hsu, 1 Bao-Ling Adam, 3 and Ziding Feng 1 1 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109-1024, U.S.A. 2 Department of Biostatistics, University of Washington, Seattle, Washington 98195-7232, U.S.A. 3 Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, Georgia 30912, U.S.A. ∗ email: yyasui@fhcrc.org Summary. Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing eﬀective classiﬁers from training data is the goal of super- vised learning. In biomedical sciences and other scientiﬁc applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classiﬁed on the basis of tissue biopsy. Although biopsy is 100% speciﬁc in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassiﬁcation due to imperfect diagnosis certainty arise in many ﬁelds. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modiﬁcation of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artiﬁcially mislabeled a subset of samples and applied the original and EM-modiﬁed boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassiﬁcation rates are observed with EM-Boost. Key words: High-dimensional data; Misclassiﬁcation; Proteomics. 1. Introduction New biomedical technologies such as gene expression ar- rays and protein mass-spectrometry proﬁles promise new ap- proaches to the diagnosis of diseases such as cancer. They promise also to provide avenues for disease screening and for predicting the prognosis of patients with disease. Algorithms will need to be developed for using the data generated by these technologies in order to classify patients, for example as having disease or not, as likely to have a good or poor prognosis and so forth. The classiﬁcation problem has a long history in the ﬁeld of statistics (Fisher, 1936; Green and Swets, 1966; McLach- lin, 1992) and includes classical methods such as discriminant analysis, logistic regression, and Bayesian decision theory. A key feature of the newer technologies that is not easily ac- commodated by classical methods, however, is that the data generated by them are of high dimension. Supervised learn- ing algorithms that have been developed recently to deal with high-dimensional data for classiﬁcation are summarized in the book by Hastie, Tibshirani, and Friedman (2001). Boosting is one algorithm that has had great success in applications. It is described by Hastie, Tibshirani, and Friedman as “one of the most powerful learning ideas that has been introduced in the last 10 years,” although there is controversy about how and why it works. We have applied boosting to protein mass- spectrometry data from serum samples of men with prostate cancer and with normal prostate glands. The resulting classi- ﬁcation algorithm was almost perfectly accurate when tested on an independent set of samples. In this article we extend boosting in an important way: to accommodate settings where the class label in the training set, Y, is subject to error. This problem arose for us when we tried to apply boosting to discriminate between prostate can- cer cases and abnormal (but apparently noncancer) controls. Men in this latter group were deﬁned by having benign hyper- plasia of the prostate (BPH). Boosting yielded a classiﬁcation algorithm that did not distinguish very well between the BPH controls and the cancer cases. However, there is evidence from the literature that up to 30% of BPH controls do in fact have cancer (Djavan et al., 2001) and we therefore suspected that the poor performance of boosting was due in part to the fact that it does not accommodate such mislabeling of Y. We denote the training data used for learning by {(Y i , X ∼ i ), i = ...N } where X ∼ i is the set of predictor variables for subject i and Y i is the dichotomous observed label, Y i =1 for a case and Y i = -1 for a control. Although our approach 199