Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients Mevlut Ture a, * , Fusun Tokatli b , Imran Kurt c a Trakya University, Medical Faculty, Department of Biostatistics, 22030 Edirne, Turkey b Trakya University, Medical Faculty, Department of Radiation Oncology, Edirne, Turkey c Eskisehir Osmangazi University, Medical Faculty, Department of Biostatistics, Eskisehir, Turkey Abstract Current evidence supports a clear association between clinical and pathologic factors and recurrence-free survival (RFS) in breast cancer patients. The Cox regression model is the most common tool for investigating simultaneously the influence of several factors on the survival time of patients. But it gives no estimate of the degree of separation of the different subgroups. We propose to analyze different decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) and use them additionally to the well-known Kaplan–Meier estimates to investigate the predictive power of these methods. Five hundred patients were included to the study. Two hundred and sev- enty-nine of them had complete data for prognostic factors and median follow-up is about 40.5 months. First, decision tree methods were analyzed for prognostic factors. Then, according to multidimensional scaling method C4.5 (error rate 0.2258 for training set and 0.3259 for cross-validation) performed slightly better than other methods in predicting risk factors for recurrence. Tumor size, age of menarche, hormonal therapy, histological grade and axillary nodal status are found that an important risk factors for the recurrence. Eight terminal nodes were found and stratified by Kaplan–Meier survival curves. Larger tumor size (P4.4 cm) and receiving no hormonal therapy in a small subgroup of patients were associated with worse prognosis. The five-year RFS is 71.3% in the whole patient population. The sen- sitivity, specificity and predictive rates calculated by C4.5 method were found 43.8%, 91% and 77.4%, respectively. In this study, C4.5 showed a better degree of separation. As a result, we recommend to use decision tree methods together with Kaplan–Meier analysis to determine risk factors and effect of this factors on survival. Ó 2008 Elsevier Ltd. All rights reserved. Keywords: Decision tree; C&RT; CHAID; QUEST; C4.5; ID3; Kaplan–Meier; Breast cancer; Recurrence-free survival 1. Introduction The clinicopathologic characteristics of breast cancer patients are heterogeneous. Consequently, the survival times are different in subgroups of patients. Generally, five-year recurrence-free survival is ranged from 65% to 80% in all population in breast cancer patients (Buchholz, Strom, & McNeese, 2003). The purposes of this study were to apply a novel analytical method to breast cancer patients to identify prognostic factors, and explore the interactions between clinical variables and their impact on survival. Decision tree algorithms allow for non-linear relations between predictive factors and outcomes and for mixed data types (numerical and categorical), isolates outliers, and incorporates a pruning process using cross-validation as an alternative to testing for unbiasedness with a second data set (Faderl et al., 2002). In the literature, there are several reports about a separation of patients in subgroups with different prognosis for survivals (Aligayer et al., 2002; Kenneth, Abbruzzese, 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.12.002 * Corresponding author. Tel.: +90 284 2357641/1631; fax: +90 284 2357652. E-mail address: ture@trakya.edu.tr (M. Ture). www.elsevier.com/locate/eswa Available online at www.sciencedirect.com Expert Systems with Applications 36 (2009) 2017–2026 Expert Systems with Applications