International Journal of Computer Applications (0975 – 8887) Volume 76 – No.10, August 2013 28 Chemotherapy Prediction of Cancer Patient by using Data Mining Techniques Reeti Yadav Invertis University, Bareilly Zubair Khan Invertis University, Bareilly Hina Saxena Invertis University, Bareilly ABSTRACT Breast cancer is one of the prominent diseases for women in developed countries including India. It is the second most frequent cause of death in women. The identification of breast cancer patients for whom chemotherapy could prolong survival time is considered here as a data mining problem. We prescribe a procedure that uses support vector machines (SVMs) and Decision tree for classifying 100 breast cancer patients into two classes which are the two types of breast cancer diseases. It then compares the performance of both the classification techniques to find the better technique among them and use the appropriate technique for the next stage i.e. clustering. The identification is achieved by making clusters of above two classes into three prognostic groups: Good, Intermediate and Poor with the help of K-Means clustering technique. The result suggests that the patients in the Good group do not require chemotherapy. Chemotherapy is not of much importance in an Intermediate class while the Poor group is the most crucial group where chemotherapy can possibly enhance their survival. General Terms Chemotherapy, Classification, Cancer. Keywords Clustering, SVM, decision tree, k-means, classification, diagnosis, data mining. 1. INTRODUCTION With ever increasing growth in science and technology, quality of human life is improving day by day. Health becomes a major concern for everyone. . Breast cancer has become the primary reason of death in women in developed countries. Today, about one in eight women over their lifetime have been affected by breast cancer in the United States. 5-10% of cancers are due to an abnormality which is inherited from the parents and about 90% of breast cancers are due to genetic abnormalities that happen as a result of the aging process [1]. The most effective way to reduce breast cancer deaths is detect it earlier. Early diagnosis requires an accurate and reliable diagnosis procedure that allows physicians to distinguish benign breast tumors from malignant ones without going for surgical biopsy [2] [3]. The objective of these predictions is to assign patients to either a “benign” group that is noncancerous or a “malignant” group that is cancerous. The prognosis problem is the long-term outlook for the disease for patients whose cancer has been surgically removed. In this problem a patient is classified as a ‘recur’ if the disease is observed at some subsequent time to tumor excision and a patient for whom cancer has not recurred and may never recur. The motive of these predictions is to handle cases for which cancer has not recurred (censored data) as well as case for which cancer has recurred at a specific time. As the use of computers powered with automated tools, large volumes of medical data are being collected and made available to the medical research groups. As a result, Knowledge Discovery in Databases (KDD), which includes data mining techniques, has become a popular research tool for medical researchers to identify and exploit patterns and relationships among large number of variables, and made them able to predict the outcome of a disease using the historical cases stored within datasets. Thus breast cancer diagnostic and prognostic problems are mainly in the scope of the widely discussed classification problems. 2. DATA MINING The data mining consists of various methods. Different methods serve different purposes, each method offering its own advantages and disadvantages .Classification and clustering are the two most common techniques of data mining which are used in field of medical science.[4] However, most data mining methods commonly used for this review are of classification category as the applied prediction techniques assign patients to either a ”benign” group that is non- cancerous or a ”malignant” group that is cancerous and generate rules for the same. Hence, the breast cancer diagnostic problems are basically in the scope of the widely discussed classification problems. In data mining, classification is one of the most important task. It maps the data in to predefined targets. It is a supervised learning as targets are predefined. The aim of the classification is to build a classifier based on some cases with some attributes to describe the objects or one attribute to describe the group of the objects. Then, the classifier is used to predict the group attributes of new cases from the domain based on the values of other attributes. The commonly used methods for data mining classification tasks can be classified into the following groups. 3. RELATED WORK In this section, we review the related work on breast cancer diagnosis using data mining techniques. In [5] to classify the medical data set a neural network approach is adopted. The neural network is trained with breast cancer data base by using feed forward neural network model and back propagation learning algorithm with momentum and variable learning rate. The performance of the network is evaluated. The experimental result shows that by applying parallel approach in neural network model yields efficient result. In