International Journal of Computer Applications (0975 – 8887) Volume 182 – No. 38, January 2019 36 Model for Predicting the Risk of Kidney Stone using Data Mining Techniques Oladeji F. A. University of Lagos Department of Computer Sciences Idowu P. A. Obafemi Awolowo University Department of Computer Science and Engineering Egejuru N. Obafemi Awolowo University Department of Computer Science and Engineering Faluyi S. G. Tai Solarin University of Education, Ijagun, Ogun State Balogun J. A. Obafemi Awolowo University Dept. of Computer Science and Engineering ABSTRACT This paper focused on the development of a predictive model for the classification of the risk of kidney stones in Nigerian using data mining techniques based on historical information elicited about the risk of kidney stones among Nigerians. Following the identification of the risk factors of kidney stone from experienced endocrinologists, structured questionnaires were used to collect information about the risk factors and the associated risk of kidney stones from selected respondents. The predictive model for the risk of kidney diseases was formulated using three (3) supervised machine learning algorithms (Decision Tree, Multi-layer perception and Genetic Algorithm) following the identification of relevant features. The predictive model was simulated using the Waikato Environment for Knowledge Analysis (WEKA) environment; and the model was validated using historical dataset of kidney stone risk via performance metrics: accuracy, true positive rate, precision and false positive rate. The paper concluded that the multi-layer perceptron had the best performance overall using the 33 initially identified variables by the endocrinologists with an accuracy of 100%. The performance of the genetic programming and multi-layer perceptron algorithms used to formulate the predictive model for the risk of kidney stones using the 6 variables outperformed the model formulated using the 6 variables identified by the C4.5 decision trees. The variables identified by the C4.5 decision trees algorithm were: obese from childhood, eating late at night, BMI class, family history of hypertension, taking coffee and sweating daily. In conclusion, the multi-layer perceptron algorithm is best suitable for the development of a predictive model for the risk of kidney stones. Keywords Kidney Stone Risk Factors, C4.5, Prediction, Classification, Decision Trees, Genetic Algorithms, Multilayer Perception 1. INTRODUCTION Predictive analytics is a branch of data mining concerned with the analysis of data to identify underlying trends, patterns, or relationships to predict future probabilities and trends [1]. It encompasses statistics, data mining and game theory that analyze current and historical facts to make predictions about future events of interest [2]. In predictive modeling, data is collected, a statistical model is formulated, predictions are made and the model is validated or revised as additional data becomes available [3]. Clinical data mining is based on strategic research to retrieve, analyze and interpret both qualitative and quantitative information available from medical datasets or records [4]. Predictive data mining automatically create classification model from training dataset, and apply such model to automatically predict other classes of unclassified datasets ([5]). Predictive data mining deals with learning models to support clinicians in diagnostics, therapeutic, or monitoring tasks [6]. It learns from past experience and apply knowledge gained to future situations [7], by applying machines learning methods to build multivariate models from clinical data and subsequently make inferences on unknown data [8]. Machine learning model is related to the exploitation of supervised classification approaches. Prior to applying the learning model, the data is pre-processed to remove noise and ensure data mining principle is applied on real data [9]. Predictive data mining is the most common type of data mining that has the most application in business and real life, that is centered on data pre-processing, data mining and data post-processing collectively referred to as Knowledge Discovery in Databases (7,10, 11]). Examples include the prediction of surgery outcome, breast cancer survival and coronary heart disease risk and from variables such as age, sex, smoking and status, hypertension and various biomarkers [12; 13; 14; 15]. [16] compared rule based Repeated Incremental Pruning to Produce Error Reduction (RIPPER), Decision Tree (DT), Artificial Neural Networks (ANN) and Support Vector Machine (SVM) on the basis of Sensitivity, Specificity, Accuracy, Error Rate, and False Positive Rate, and 10-fold cross validation to measure the unbiased estimate of these prediction models. [17] demonstrated how to implement an evidence-based clinical expert system of a Bayesian model to detect coronary artery disease. The Bayesian was considered to have considerable advantage in dealing with several missing variables compared to logistics and linear regression models. In the diagnosis of Asthma with expert system, [18] did a comparative analysis of machine learning algorithms such as Auto-associative Memory Neural Networks (AMNN), Bayesian networks, ID3 and C4.5 and found AMNN to perform best in terms of algorithms efficiency and accuracy of disease diagnosis. In a study of Phospholipidosis, [19] used structure-activity relationships (SAR) to compare k-NN, DT, SVM and artificial immune systems algorithms trained to identify drugs with Phospholipidosis potentials and SVM produced the best predictions followed by a Multilayer Perceptron artificial neural network, logistic regression, and k-NN. In the diagnosis of Chronic Obstructive Pulmonary and