Vol.:(0123456789) 1 3 https://doi.org/10.1007/s12553-021-00551-9 ORIGINAL PAPER On the goodness of fit of parametric and non‑parametric data mining techniques: the case of malaria incidence thresholds in Uganda Francis Fuller Bbosa 1,2  · Josephine Nabukenya 2  · Peter Nabende 2  · Ronald Wesonga 3 Received: 22 December 2020 / Accepted: 14 April 2021 © IUPESM and Springer-Verlag GmbH Germany, part of Springer Nature 2021 Abstract To identify which data mining technique (parametric or non-parametric) best fts the predictions on imbalanced malaria incidence dataset. The researchers compared parametric techniques in form of naïve Bayes and logistic regression against non-parametric techniques in form of support vector machines and artifcial neural networks and their goodness of ft and prediction was assessed using 10-fold and 5-fold cross-validation on an independent validation dataset set to determine which model best fts the predictions on imbalanced malaria incidence dataset. The 10-fold cross-validation outperformed the 5-fold cross-validation in all performance metrics with the naïve Bayes classifer attaining accuracy of 69% with a sensitivity of 90.9%, a specifcity of 55.6%, a precision of 55.6% and F-measure score of 69.0%, the logistic regression achieved an accuracy of 65.5% with a sensitivity of 83.3%, a specifcity of 52.9%, a precision of 55.6% and F-measure score of 66.7%, the support vector machines achieved an accuracy of 82.8% with a sensitivity of 88.2%, a specifcity of 75.0%, a precision of 83.3%, and F-measure score of 85.7% whereas the artifcial neural networks registered an accuracy of 89.7% with a sensitivity of 94.1%, a specifcity of 83.3%, a precision of 88.9%, and F-measure score of 91.4%. Non-parametric data mining techniques in form of artifcial neural networks and support vector machines outperformed the parametric data mining technique in form of naïve Bayes in making predictions emanating from imbalanced malaria incidence dataset on account of registering higher F-measure values of 91.4% and 85.7% respectively. Keywords Data mining · Prediction · Parametric · Non-parametric · Comparison · Malaria 1 Introduction In the past decade, machine learning models particularly data mining have gained the attention of several scholars [14] while undertaking predictive studies. According to Hagenauer, Omrani and Helbich [5], data mining encom- passes several inductive techniques that identify hidden pat- terns, by repetitively learning from training data and relating a target output attribute to underlying explanatory attributes. The learned model from the training data can then be used to classify or predict previously unknown instances [6, 7].  Agyapong, Hayfron-Acquah, & Asante [8] assert that pre- dictive data mining approaches also known as classifcation learns from the training set, where all attributes are already associated with known class labels and build a model which is used to estimate unknown values of new attributes [9, 10]. Furthermore, predictive data mining techniques are split into parametric and non-parametric depending on the nature of assumptions about the form of relationship between the antecedent and consequent attributes [11, 12]. Parametric techniques in the context of machine learning assume a fnite set of parameters and underlying assumptions about data structure whereas non-parametric are generalized since they do not take into consideration any assumptions about the probability distribution of the data [13]. Parametric data mining techniques such as linear or multi regression, and naïve Bayes have gained popularity as predictive and heuristic models [11] due to their capability to comprehend underlying interactions among attributes in data. Whilst these parametric models have conventionally contributed to understanding underlying relationships and assumptions * Francis Fuller Bbosa fullerbbosa@gmail.com 1 School of Statistics and Planning, Makerere University, Kampala, Uganda 2 School of Computing and Informatics Technology, Makerere University, Kampala, Uganda 3 Department of Statistics, College of Science, Sultan Qaboos University, Muscat, Oman / Published online: 21 April 2021 Health and Technology (2021) 11:929–940