Prediction of Diabetes Disease using Data Mining Classification Techniques Shahzad Ali Department of Computer Science National Textile University Faisalabad, Pakistan Shazadali039@gmail.com Muhammad Usman Department of Computer Science National Textile University Faisalabad, Pakistan nahaing44@gmail.com Dawood Saddique Department of Computer Science National Textile University Faisalabad, Pakistan dawoodsaddique1997@gmail.com Umair Maqbool Department of Computer Science National Textile University Faisalabad, Pakistan umairmaqbool.007@gmail.com Muhammad Usman Aslam Department of Computer Science National Textile University Faisalabad, Pakistan usmanaslam402@gmail.com Shoaib Ejaz Department of Computer Science National Textile University Faisalabad, Pakistan shoaibejazabc@gmail.com Abstract— Diabetes is one of the chronic diseases in which the blood sugar or blood glucose level is above a certain amount in the body. It is often known as the silent killer because of its easy- to-miss symptoms of the Diabetes Disease (DD). Gestational diabetes is a type of diabetes which occurs in women during their pregnancy and can cause potential health issues for both the mother and the child. The classification of the DD is essential to improve the quality of life of patients suffering from the disease. The primary objective of this research work is to identify the most dominant feature for the DD and to classify the DD for its early diagnosis. Data mining and machine learning (ML) techniques including Naive Bayes, Artificial Neural Network (ANN), Decision Tree (DT), Logistic Regression, and Support Vector Machine (SVM) are used to predict the DD. Pima Indian Diabetes (PID) dataset is used in this experimental investigation, and the performance of the developed models is evaluated using various performance evaluation matrices. The results indicate that the proposed methodology successfully classifies the DD as compared to techniques used in the past. The result also revealed that the proposed methodology could be successfully used in medical and healthcare centers for the classification and early diagnosis of the DD. Keywords— Diabetes Disease; Classification; Data mining; Logistic Regression I. INTRODUCTION Diabetes throughout the years in one of the chronic and significant issues of today’s society health care problems. In the diabetes condition, the amount of glucose is above a certain amount in the body. In most industrialized nations, there is substantial evidence that diabetes is the fourth leading cause of death [1]. Diabetes disease occurs typically when a person’s body is not able to respond to insulin or exceed the limit of insulin required to maintain the glucose rate in the body. Diabetes has different stages, and every stage has its side effects. DD leads to several other diseases, i.e., blood pressure, heart disease, blindness, kidney failure, and nerve damage [2]. The data attributes studied for the research purpose is to contain the data of pregnant women having diabetes. Pregnant women with insulin-dependent diabetes mellitus have a high risk of getting a chronic disease. The study is carried out to extract the factor which women more is pregnancy [3]. Disorder of glucose tolerance is gestational diabetes, which diagnosed in women during their pregnancy period. There is no role of insulin in this scenario. This disease is playing a cardinal role in health issues throughout the world. Gestational diabetes mellitus (GDM) affects up to 1% to 25% of all pregnancies globally [4], and it has a rapidly increasing rate. While the high blood glucose of GDM usually resolves after delivery, women with GDM have an increased risk of further episodes of GDM [5] and are seven times more likely to develop type 2 diabetes mellitus [6]. This concept is highlighted by the World Health Organization (WHO) [7]. The working done was not only to treat the physical symptoms but also instilling the positive mental state [8]. Machine learning approaches are used to find useful patterns within the datasets. Using the approaches, the primary goal is to discover the knowledge which is not valid and accurate but is also comprehensible and can be used for well fair of society. A medical diagnosis is always a classification problem. Classification is one of the most widely used data mining and machine learning (ML) technique in the medical and healthcare centers. There is an extensive hub of different algorithms and techniques used in data mining and machine learning approach specifically for supervised ML techniques. Thus, the selection of the most suitable algorithm or techniques has been a challenge for investigators in implementing the DD- detection and early diagnosis systems [9]. In this investigation, we proposed different data mining and ML classification algorithm to train the model. We will check the efficiency between the different algorithms and proposed the best one who extracts the most accuracy for the classification of the type of diabetes. Comparing these efficiencies provide us to deal with the disease in a better way to improve the quality of life of the patient suffering from diabetes. To extract the maximum efficiency, the correlation between attribute is measured, and the attributes interlinked are