Diabetes Mellitus Prediction using Different Ensemble Machine Learning Approaches Md. Tanvir Islam 1 , M. Raihan 2 , Nasrin Aktar 3 , Md. Shahabub Alam 4 , Romana Rahman Ema 5 and Tajul Islam 6 Department of Computer Science and Engineering, North Western University, Khulna, Bangladesh 1-3,5,6 Khulna University of Engineering & Technology, Khulna, Bangladesh 1,5,6 Ahsanullah University of Science and Technology, Dhaka, Bangladesh 4 Emails: tanvirislamnwu@gmail.com 1 , mraihan@nwu.edu.bd 2 , raihanbme@gmail.com 2 , nasrinlipinwu@gmail.com 3 , nabid.aust37@gmail.com 4 , romanacsejstu@gmail.com 5 and tajulkuet09@gmail.com 6 Abstract—Nowadays Diabetes Mellitus is one of the most rapidly growing diseases which makes the biggest contribution to morbidity and mortality worldwide. Diabetes Mellitus is a group of metabolic disorders defined by high blood glucose level over a prolonged period. Although this disease is familiar as hereditary disease, many people are suffering from this disease without having family background. If diabetes is not in control, the level of glucose goes up and it may cause damage to small vessels in human body which appears most often in the nerves, feet, eyes even in heart and kidneys. To get rid of these issues, it is very crucial to predict diabetes on the early stage. Hence, we have decided to do research on diabetes prediction using Machine Learning algorithms. In this study, we have used three popular Machine Learning algorithms called AdaBoost, Bagging and Random Forest. To train and test the algorithms we have collected real time information of both diabetic and non-diabetic people. The dataset contains 464 instances with 22 unique risk factors. In between the three algorithms, AdaBoost gave 97.84% accuracy, Bagging gave 98.28% accuracy and Random Forest gave 99.35% accuracy with respect to predict diabetes disease precisely. Keywords—Diabetes Mellitus, Machine Learning, Classifica- tion, Prediction, AdaBoost, Bagging, Random Forest. I. I NTRODUCTION Diabetes Mellitus (DM) is generally known as Diabetes which blocks human body from getting the energy properly from the food we eat. It is a chronic stage associated with unusually high level of glucose in our blood. The pancreas produces insulin which lowers the glucose level. The insuffi- ciency of production of insulin or any inability in using insulin properly in our body causes diabetes [1]. DM has been one of the fastest spreading diseases at present world. According to statistics, by the end of 2017, approximately 425 million people aged between 20 to 79 years were having diabetes and it is estimated that this number will rise to 629 million before 2045 [2]. By 2015, 30.1 million or 9.4% of Americans were affected by diabetes, among them 1.25 million were children [3]. Every year 1.5 million new affected Americans are joining in this list. Among the mature people in the top five South- East Asian countries, Bangladesh was the second in the list with 5.2 million DM patients in 2013. It is estimated that this number will rise to 8.20 million in 2035 [4]. So, it is clear that, DM has been a universal problem and it is high time to find out the best practical solution. Machine Learning (ML) is the field of Data Mining and study of algorithms where these types of problems can be solved using algorithms and sample datasets [5]. The motive of our work is to analysis on diabetes patients' datasets to recognize diabetes accurately using three ML algorithms, AdaBoost, Bagging and Random Forest (RF). II. RELATED WORKS Ayman Mir et al. [6] have performed an analysis to predict diabetes disease using ML techniques on big data of healthcare. They used several ML algorithms such as Na¨ ıve Bayes, Support Vector Machine (SVM), Random Forest and Simple CART. The dataset contains 9 attributes with having both numerical and nominal values. The obtained accuracy for Na¨ ıve Bayes is 77%, SVM is 79.13%, RF is 76.5%, and Simple CART is 76.5%. Another research performed for the purpose of indicating the critical features for predicting diabetes. The algorithms have been used in this research are Logistic Regression (LR), SVM and RF. In the analysis, researchers found RF as the best algorithm to predict diabetes which gave 84% accuracy [7]. Similarly, an analysis has been conducted based on ML algorithms where analysists used SVM, AdaBoost, Bagging, K-NN, RF algorithms with a dataset of 506 instances and 30 features. They got 75.49% accuracy for AdaBoost, 76.28% for Bagging, 72.33% for K- NN, 75.30% for RF, 72.72% for SVM [8]. Durga Kinge et. al. conducted an analysis to determine the performances of several algorithms named Decision Tree (J48), Na¨ ıve Bayes, RF, AdaBoost, Bagging, Multilayer Perceptron (MLP), Simple Lo- gistic to predict diseases using data mining and ML techniques. A dataset of heart disease having total 303 instances with 74 raw attributes was taken and only one 14 significant features were used among them. Accuracy for J48, Na¨ ıve Bayes, RF, AdaBoost, Bagging, Multilayer Perceptron (MLP), Simple Logistic algorithms are 78.15%, 82.59%, 83.15%, 81.59%, 81.59%, 79.41% and 83.1% respectively [9]. Soumayadeep Manna et al. have conducted a research to predict the important factors that cause diabetes. They have used a dataset which contains 3075 instances and each instance has 8 factors. They have used LR and RF whereas RF gave 86.70% accuracy and LR gave 89.17% accuracy [10]. Deepika Verma and Nidhi Mishra conducted a study to identify DM by using a dataset on Na¨ ıve Bayes, J48, Sequential Minimal Optimization (SMO), MLP, and Reduces Error Pruning Tree (REP-tree) algorithms and they found SMO to give 76.80% accuracy on diabetes dataset [11]. Another research team developed a system using IEEE - 49239 11th ICCCNT 2020 July 1-3, 2020 - IIT - Kharagpur, Kharagpur, India Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 01,2020 at 16:55:01 UTC from IEEE Xplore. Restrictions apply.