Cancer Disease Prediction with Support Vector Machine and Random Forest Classification Techniques Ashfaq Ahmed K and Sultan Aljahdali College of Computers and Information Technology Taif University, Taif, Saudi Arabia ashfaqme@gmail.com, aljahdali@tu.edu.sa Nisar Hundewale and Ishthaq Ahmed K Department of Computer Science nisar@computer.org, ishthaq@gmail.com Abstract—The Concept of classification and learning will suit well to medical applications, especially those that need complex diagnostic measurements. Therefore classification technique can be used for cancer disease prediction. This approach is very much interesting as it is part of a growing demand towards predictive diagnosis. From the available studies it is evident that classification and learning methods can be used effectively to improve the accuracy of predicting a disease and its recurrence. In the present work classification techniques namely Support Vector Machine [SVM] and Random Forest [RF] are used to learn, classify and compare cancer disease data with varying kernels and kernel parameters. Results with Support Vector Machines and Random Forest are compared for different data sets. The results with different kernels are tuned with proper parameters selection. Results are analyzed with confusion matrix. Keywords- Support Vector Machine, Random Forest, Radial Basis Function, Sigmoid. I. INTRODUCTION Cancer disease prediction involves more than one physician from different specializations. This needs different biomedical markers and multiple clinical factors like the age, general health of the patient, its location, type of cancer, the grade and size of the tumor. Cell based, patient based and population based information all must be carefully considered by the attending medical practitioner to come out with a reasonable prediction. It is not so easy even for the most skilled technician to do. Both physicians and patients need to face same challenges when it comes to the matter of cancer prevention and cancer prediction. Family history, age, diet, weight, habits like smoking, heavy drinking, and exposure ultra violet radiations, radon, asbestos plays a major role in predicting an individual’s risk for developing cancer. These conventional clinical, behavioral parameters and environment may not be sufficient to make better predictions. To predict the disease we need some specific molecular details about either the tumor or the patient’s genetic status. With the speedy development of the proteomic, genomic and imaging technologies, this molecular scale information about patients or tumors is now can be readily acquired. Many attempts to predict the cancer disease exist such as decision trees, expert systems, neural networks and genetic algorithms etc. However, little significant work has been performed to optimize the parameters of the svm model and compare the results with one of the other powerful learning techniques. In [1] different novel algorithms are presented for cancer disease prediction. The paper establishes that the concept of support vector is good for better predictions. In [2] micro array cancer data sets are used for predicting the cancer disease with random forest and support vector machine. It establishes that these techniques yield better results with smaller number of genes. In [3] support vector machine are used to predict the different levels of cancer growth. It proposes the optimum size for training sets. It is evident from the past studies that support vector machine and random forest are the better learning techniques for the cancer disease diagnosis. Also they yield much better results with proper parameters selection and size of training data sets. Section II A discusses about the Support Vector Machine in detail, Section II B about the Random Forest, Section II C is about the data set used, Section III A discusses about experiment setup and Section III B about the actual experiments and Section III C is on discussions about results obtained. II. CLASSIFICATION TECHNIQUES USED A. Support Vector Machine The technique of empirical data modeling is applicable to many engineering applications. In empirical data modeling an induction process is used to build up a model of the system, from which it can deduce responses of the system which are to be tested or observed. The observational nature data obtained is finite and taken as a sample. This sampling is non-uniform and due to the high dimensional nature of the problem data, the input space will be in a sparse distribution. As a result the problem is wrongly presented. Neural network approaches have suffered problems with generalization producing models that get over fit with the data. 978-1-4673-0892-2/12/$31.00 ©2012 IEEE 16 CyberneticsCom 2012