Vol.:(0123456789) 1 3
https://doi.org/10.1007/s12553-021-00551-9
ORIGINAL PAPER
On the goodness of fit of parametric and non‑parametric data mining
techniques: the case of malaria incidence thresholds in Uganda
Francis Fuller Bbosa
1,2
· Josephine Nabukenya
2
· Peter Nabende
2
· Ronald Wesonga
3
Received: 22 December 2020 / Accepted: 14 April 2021
© IUPESM and Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
To identify which data mining technique (parametric or non-parametric) best fts the predictions on imbalanced malaria
incidence dataset. The researchers compared parametric techniques in form of naïve Bayes and logistic regression against
non-parametric techniques in form of support vector machines and artifcial neural networks and their goodness of ft and
prediction was assessed using 10-fold and 5-fold cross-validation on an independent validation dataset set to determine
which model best fts the predictions on imbalanced malaria incidence dataset. The 10-fold cross-validation outperformed
the 5-fold cross-validation in all performance metrics with the naïve Bayes classifer attaining accuracy of 69% with a
sensitivity of 90.9%, a specifcity of 55.6%, a precision of 55.6% and F-measure score of 69.0%, the logistic regression
achieved an accuracy of 65.5% with a sensitivity of 83.3%, a specifcity of 52.9%, a precision of 55.6% and F-measure score
of 66.7%, the support vector machines achieved an accuracy of 82.8% with a sensitivity of 88.2%, a specifcity of 75.0%, a
precision of 83.3%, and F-measure score of 85.7% whereas the artifcial neural networks registered an accuracy of 89.7%
with a sensitivity of 94.1%, a specifcity of 83.3%, a precision of 88.9%, and F-measure score of 91.4%. Non-parametric
data mining techniques in form of artifcial neural networks and support vector machines outperformed the parametric data
mining technique in form of naïve Bayes in making predictions emanating from imbalanced malaria incidence dataset on
account of registering higher F-measure values of 91.4% and 85.7% respectively.
Keywords Data mining · Prediction · Parametric · Non-parametric · Comparison · Malaria
1 Introduction
In the past decade, machine learning models particularly
data mining have gained the attention of several scholars
[1–4] while undertaking predictive studies. According to
Hagenauer, Omrani and Helbich [5], data mining encom-
passes several inductive techniques that identify hidden pat-
terns, by repetitively learning from training data and relating
a target output attribute to underlying explanatory attributes.
The learned model from the training data can then be used
to classify or predict previously unknown instances [6, 7].
Agyapong, Hayfron-Acquah, & Asante [8] assert that pre-
dictive data mining approaches also known as classifcation
learns from the training set, where all attributes are already
associated with known class labels and build a model which
is used to estimate unknown values of new attributes [9, 10].
Furthermore, predictive data mining techniques are split
into parametric and non-parametric depending on the nature
of assumptions about the form of relationship between the
antecedent and consequent attributes [11, 12]. Parametric
techniques in the context of machine learning assume a
fnite set of parameters and underlying assumptions about
data structure whereas non-parametric are generalized since
they do not take into consideration any assumptions about
the probability distribution of the data [13]. Parametric
data mining techniques such as linear or multi regression,
and naïve Bayes have gained popularity as predictive and
heuristic models [11] due to their capability to comprehend
underlying interactions among attributes in data. Whilst
these parametric models have conventionally contributed
to understanding underlying relationships and assumptions
* Francis Fuller Bbosa
fullerbbosa@gmail.com
1
School of Statistics and Planning, Makerere University,
Kampala, Uganda
2
School of Computing and Informatics Technology, Makerere
University, Kampala, Uganda
3
Department of Statistics, College of Science, Sultan Qaboos
University, Muscat, Oman
/ Published online: 21 April 2021
Health and Technology (2021) 11:929–940