Research Article Open Access
Volume 4 • Issue 2 • 1000124
J Health Med Inform
ISSN: 2157-7420 JHMI, an open access journal
Health & Medical
Informatics
Keywords:
Classiication; Decision tree; Machine learning; Support vector
machine; 10-Fold cross-validation
Introduction
Breast cancer (BC) is the most common cancer in women, afecting
about 10% of all women at some stages of their life. In recent years,
the incidence rate keeps increasing and data show that the survival
rate is 88% ater ive years from diagnosis and 80% ater 10 years from
diagnosis [1]. Early prediction of breast cancer is one of the most
crucial works in the follow-up process. Data mining methods can help
to reduce the number of false positive and false negative decisions [2,3].
Consequently, new methods such as knowledge discovery in databases
(KDD) has become a popular research tool for medical researchers
who try to identify and exploit patterns and relationships among
large number of variables, and predict the outcome of a disease using
historical cases stored in datasets [4].
In this paper, using data mining techniques, authors developed
models to predict the recurrence of breast cancer by analyzing
data collected from ICBC registry. he next sections of this paper
review related work, describe background of this study, evaluate
three classiication models (C4.5 DT, SVM, and ANN), explain the
methodology used to conduct the prediction, present experimental
results, and the last part of the paper is the conclusion. To estimate
validation of the models, accuracy, sensitivity, and speciicity were used
as criteria, and were compared.
Literature review and previous works
models to predict 5, 10, and 15 -year breast cancer survival. hey
studied 951 breast cancer patients and used tumor size, axillary nodal
status, histological type, mitotic count, nuclear pleomorphism, tubule
formation, tumor necrosis, and age as input variables [7]. Pendharker
patterns in breast cancer. In this study, they showed that data mining
could be a valuable tool in identifying similarities (patterns) in breast
cancer cases, which can be used for diagnosis, prognosis, and treatment
purposes [4]. hese studies are some examples of researches that apply
data mining to medical ields for prediction of diseases.
Materials and Methods
In order to predict the 2-year recurrence rate of breast cancer, we
used ICBC dataset in the National Cancer Institute of Tehran for the
years 1997-2008. he ICBC is responsible for collecting incidence and
survival data from the participating registries, and disseminating these
datasets for the purpose of conducting analytical research projects.
his dataset contained population characteristics and included 22
input variables. Our cases were collected from the total number of
1189 women that were diagnosed breast cancer. We preprocessed
the data to remove unsuitable cases. Ater using data cleansing and
data preparation strategies, the inal dataset was constructed. Finally,
547 cases were analyzed ater 642 records were excluded because of
missing data. Patients with breast cancer recurrence were followed-up
*Corresponding author: Leila Ghasem Ahmad, Department of Management
Information Systems, Science and Research Branch, Islamic Azad University of
Tehran-Iran, Iran, E-mail: lga_77@yahoo.com
Received January 28, 2013; Accepted April 18, 2013; Published April 24, 2013
Abstract
Objective: The number and size of medical databases are increasing rapidly but most of these data are not ana-
lyzed for inding the valuable and hidden knowledge. Advanced data mining techniques can be used to discover hidden
patterns and relationships. Models developed from these techniques are useful for medical practitioners to make right
decisions. The present research studied the application of data mining techniques to develop predictive models for
breast cancer recurrence in patients who were followed-up for two years.
Method: The patients were registered in the Iranian Center for Breast Cancer (ICBC) program from 1997 to 2008.
The dataset contained 1189 records, 22 predictor variables, and one outcome variable. We implemented machine
learning techniques, i.e., Decision Tree (C4.5), Support Vector Machine (SVM), and Artiicial Neural Network (ANN) to
develop the predictive models. The main goal of this paper is to compare the performance of these three well-known
algorithms on our data through sensitivity, speciicity, and accuracy.
Results and Conclusion: Our analysis shows that accuracy of DT, ANN and SVM are 0.936, 0.947 and 0.957
respectively. The SVM classiication model predicts breast cancer recurrence with least error rate and highest accuracy.
The predicted accuracy of the DT model is the lowest of all. The results are achieved using 10-fold cross-validation for
measuring the unbiased prediction accuracy of each model.
Using Three Machine Learning Techniques for Predicting Breast Cancer
Recurrence
Ahmad et al., J Health Med Inform 2013, 4:2
http://dx.doi.org/10.4172/2157-7420.1000124
Ahmad LG*, Eshlaghy AT, Poorebrahimi A, Ebrahimi M and Razavi AR
Department of Management Information Systems, Science and Research Branch, Islamic Azad University of Tehran-Iran, Iran
Citation: Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR
(2013) Using Three Machine Learning Techniques for Predicting Breast Cancer
Recurrence. J Health Med Inform 4: 124. doi:10.4172/2157-7420.1000124
Copyright: © 2013 Ahmad LG, et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
A literature review showed that there have been several studies on the
survival prediction problem using statistical approaches and artiicial
neural networks. However, we could only ind a few studies related
to medical diagnosis and recurrence using data mining approaches
such as decision trees [5,6]. Delen et al. used artiicial neural networks,
decision trees and logistic regression to develop prediction models for
breast cancer survival by analyzing a large dataset, the SEER cancer
incidence database [6]. Lundin et al. used ANN and logistic regression