YOLANDA D. AUSTRIA et al: COMPARISON OF MACHINE LEARNING ALGORITHMS IN BREAST CANCER . . DOI 10.5013/IJSSST.a.20.S2.23 23.1 ISSN: 1473-804x online, 1473-8031 print Comparison of Machine Learning Algorithms in Breast Cancer Prediction using the Coimbra Dataset Yolanda D. Austria 1 , Jay-ar P. Lalata 2 , Lorenzo B. Sta. Maria, Jr. 3 , Joselito Eduard E. Goh 4 Marie Luvett I. Goh 2 , Heintjie N.Vicente 2 1 Adamson University, San Marcelino St. Ermita, Manila, Philippines. 2 FEU Institute of Technology, P. Paredes St. Sampaloc, Manila, Philippines. 3 Asian Institute of Management, Paseo de Roxas, Legazpi Village, Makati, Philippines. 4 De La Salle – College of St. Benilde, Taft Ave., Malate, Manila, Philippines. yolanda.austria@adamson.edu.ph; jayar_030181@yahoo.com; lorenzo_stamaria@yahoo.com; joedgoh@gmail.com; luvett.goh@gmail.com; hnvicente@feutech.edu.ph Abstract - In the medical field, machine learning (ML) techniques are playing a significant and growing role because of their high potential in helping health practitioners make decisions and diagnosis. This inspective research aims to review ML models that may predict breast cancer in women and to compare their performances. A number of clinical features were measured among the 116 participants in the dataset of this study including insulin, glucose, resistin, adiponectin, homeostasis model assessment (HOMA), leptin, monocyte chemoattractant protein-1 (MCP-1), along with their age and body mass index (BMI). The researchers implemented 11 classification algorithms and their variations including Logistic Regression (LR), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Method (GBM), and Naive Bayes (NB), in the detection of breast cancer on the publicly available Coimbra Breast Cancer Dataset (CBCD). Each classifier applies a unique hyper-parameter setting to perform prediction and their performances are compared in identifying breast cancer. As a conclusion of this study, Gradient Boosting (GB) machine learning algorithm is the best classifier in predicting breast cancer using the Coimbra Breast Cancer Dataset (CBCD) with an accuracy of 74.14%. k-Nearest Neighbor (kNN) classifier produces the fastest training time at 0.000598 seconds while Nonlinear Support Vector Machine (SVM) classifier gives with the fastest testing time at 0 seconds. Another conclusion of this paper is that the body mass index (BMI) is the top predictor, with 50% of the classifiers observing it as their top predictor and Glucose comes in second. This recommends that they may be a good pair of variables, which may predict breast cancer in women. Keywords - breast cancer, machine learning algorithm, classifier, Logistic Regression (LR), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Method (GBM), Naive Bayes (NB) I. INTRODUCTION According to the World Health Organization (WHO) one of the primary causes of death worldwide in 2018 is cancer. In the estimated 9.6 million deaths in cancer, breast cancer is the second most prevalent cancer, subsequent to lung cancer, with 2.09 million cases. It is also the fifth most common reason of cancer death, with an approximated 627 000 deaths, that is estimated 15% of all cancer deaths among women. [1] And in all new cancer diagnoses for women, breast cancer alone accounts for 30% all these new cases [2]. At present, X-ray mammography is the lone procedure that has the capability of detecting early-stage breast cancer, or before the cancer is self-evident. It is also the basis of the most systematized breast screening programs to detect breast cancer in an asymptomatic population. To successfully detect breast cancer in its beginning phase, however, mammography must sufficiently differentiate small masses and micro-calcifications, which in principle can only produce subtle contrast differences in mammography images [3]. Even though mammography is currently the widely used standard screening process for breast cancer, the incidents of incorrect classifications of mammograms, is still one of the areas for improvement in breast cancer forecasting. Thus, there is still a challenge to discover effective predictors, which may come from cheap and easily accessible methods. Bodily parameters, such as those obtainable from blood samples, may provide alternative ways to better diagnose breast cancer among women [4]. Alternative ways of detecting breast cancer, specifically, ones that are non-invasive are evident in several recent studies. Exhaled breath and urine analysis, for instance, were used in a study of a non-invasive early discovery of breast cancer using an Artificial Neural Network (ANN) model. [5] The combination of age, body mass index (BMI), and metabolic parameters, in another paper, was concluded as a potential inexpensive and effective predictor for breast cancer [6].