Cancer Disease Prediction with Support Vector
Machine and Random Forest Classification
Techniques
Ashfaq Ahmed K and Sultan Aljahdali
College of Computers and Information Technology
Taif University,
Taif, Saudi Arabia
ashfaqme@gmail.com, aljahdali@tu.edu.sa
Nisar Hundewale and Ishthaq Ahmed K
Department of Computer Science
nisar@computer.org, ishthaq@gmail.com
Abstract—The Concept of classification and learning will suit
well to medical applications, especially those that need complex
diagnostic measurements. Therefore classification technique can
be used for cancer disease prediction. This approach is very
much interesting as it is part of a growing demand towards
predictive diagnosis. From the available studies it is evident that
classification and learning methods can be used effectively to
improve the accuracy of predicting a disease and its recurrence.
In the present work classification techniques namely Support
Vector Machine [SVM] and Random Forest [RF] are used to
learn, classify and compare cancer disease data with varying
kernels and kernel parameters. Results with Support Vector
Machines and Random Forest are compared for different data
sets. The results with different kernels are tuned with proper
parameters selection. Results are analyzed with confusion matrix.
Keywords- Support Vector Machine, Random Forest, Radial
Basis Function, Sigmoid.
I. INTRODUCTION
Cancer disease prediction involves more than one physician
from different specializations. This needs different biomedical
markers and multiple clinical factors like the age, general
health of the patient, its location, type of cancer, the grade and
size of the tumor. Cell based, patient based and population
based information all must be carefully considered by the
attending medical practitioner to come out with a reasonable
prediction. It is not so easy even for the most skilled technician
to do. Both physicians and patients need to face same
challenges when it comes to the matter of cancer prevention
and cancer prediction. Family history, age, diet, weight, habits
like smoking, heavy drinking, and exposure ultra violet
radiations, radon, asbestos plays a major role in predicting an
individual’s risk for developing cancer. These conventional
clinical, behavioral parameters and environment may not be
sufficient to make better predictions. To predict the disease we
need some specific molecular details about either the tumor or
the patient’s genetic status. With the speedy development of
the proteomic, genomic and imaging technologies, this
molecular scale information about patients or tumors is now
can be readily acquired.
Many attempts to predict the cancer disease exist such as
decision trees, expert systems, neural networks and genetic
algorithms etc. However, little significant work has been
performed to optimize the parameters of the svm model and
compare the results with one of the other powerful learning
techniques.
In [1] different novel algorithms are presented for cancer
disease prediction. The paper establishes that the concept of
support vector is good for better predictions.
In [2] micro array cancer data sets are used for predicting
the cancer disease with random forest and support vector
machine. It establishes that these techniques yield better results
with smaller number of genes.
In [3] support vector machine are used to predict the
different levels of cancer growth. It proposes the optimum size
for training sets.
It is evident from the past studies that support vector
machine and random forest are the better learning techniques
for the cancer disease diagnosis. Also they yield much better
results with proper parameters selection and size of training
data sets.
Section II A discusses about the Support Vector Machine
in detail, Section II B about the Random Forest, Section II C is
about the data set used, Section III A discusses about
experiment setup and Section III B about the actual
experiments and Section III C is on discussions about results
obtained.
II. CLASSIFICATION TECHNIQUES USED
A. Support Vector Machine
The technique of empirical data modeling is applicable to
many engineering applications. In empirical data modeling an
induction process is used to build up a model of the system,
from which it can deduce responses of the system which are to
be tested or observed. The observational nature data obtained is
finite and taken as a sample. This sampling is non-uniform and
due to the high dimensional nature of the problem data, the
input space will be in a sparse distribution. As a result the
problem is wrongly presented.
Neural network approaches have suffered problems with
generalization producing models that get over fit with the data.
978-1-4673-0892-2/12/$31.00 ©2012 IEEE 16 CyberneticsCom 2012