FOCUS Semi-supervised model applied to the prediction of the response to preoperative chemotherapy for breast cancer Frederico Coelho • Anto ˆnio de Pa ´dua Braga • Rene ´ Natowicz • Roman Rouzier Published online: 12 March 2010 Ó Springer-Verlag 2010 Abstract Breast cancer is the second most frequent one, and the first one affecting the women. The standard treat- ment has three main stages: a preoperative chemotherapy followed by a surgery operation, then an post-operatory chemotherapy. Because the response to the preoperative chemotherapy is correlated to a good prognosis, and because the clinical and biological information do not yield to efficient predictions of the response, a lot of research effort is being devoted to the design of predictors relying on the measurement of genes’ expression levels. In the present paper, we report our works for designing genomic predictors of the response to the preoperative chemother- apy, making use of a semi-supervised machine learning approach. The method is based on margin geometric information of patterns of low density areas, computed on a labeled dataset and on an unlabeled one. 1 Introduction Predicting the response of a patient to preoperative che- motherapy from the measurement of genes expressions is being a main issue in clinical cancer research since DNA microarrays have become available, about 10 years ago. The importance of developing such predictors relies on the fact that only 30% of patients have a positive response to the treatment and, in absence of efficient predictors of the response, most of the patients are allocated to the standard treatment. A lot of statistical and machine learning models have been developed to address the problem (Cooper 2001; Glas et al. 2006; Ancona et al. 2006; Michiels et al. 2005), but no genomic predictor is yet accurate enough to be used in clinical routine. Among the main issues in the devel- opment of such models are: (a) selecting relevant genes to enter the predictors among thousands of genes whose expression levels are measured by DNA microarrays (the vast majority of them being not involved in the response to the chemotherapy treatments), (b) the small number of cases compared to the numbers of features (genes expres- sions), (c) the representativeness of the data. These diffi- culties are challenging for the development and the validation of prediction models. In the particular case of the application reported in this article, the dataset are, for each patient case, the expres- sions of a set of genes considered as relevant markers of the response to the chemotherapy, and the outcome of the treatment. The data themselves has been collected in a clinical trial in which 133 patients were embedded. The clinical trial was jointly conducted at the Institut Gustave Roussy (Villejuif, France) where 51 patients were cared, and at the MD Anderson Cancer Center (Houston, USA), where 82 patients were cared. All the patients were allo- cated to a preoperative chemotherapy treatment, to which F. Coelho (&) A. de Pa ´dua Braga PPGEE, CPDEE Universidade Federal de Minas Gerais, Belo Horizonte, Brazil e-mail: fredgfc@cpdee.ufmg.br A. de Pa ´dua Braga e-mail: apbraga@ufmg.br R. Natowicz ESIEE-Paris, De ´partement d’ı ´nformatiquex, Universite ´ Paris-Est, Paris, France e-mail: r.natowicz@esiee.fr R. Rouzier De ´partment of Gynecology, Ho ˆpital Tenon, Paris, France e-mail: roman.rouzier@tnn.aphp.fr 123 Soft Comput (2011) 15:1137–1144 DOI 10.1007/s00500-010-0589-8