DOI: http://dx.doi.org/10.26483/ijarcs.v12i2.6696
Volume 12, No. 2, March-April 2021
International Journal of Advanced Research in Computer Science
RESEARCH PAPER
Available Online at www.ijarcs.info
© 2020-2022, IJARCS All Rights Reserved 39
ISSN No. 0976-5697
WELL-CALIBRATED PROBABILISTIC MACHINE LEARNING CLASSIFIERS
FOR MULTIVARIATE HEALTHCARE DATA
Akram Pasha
School of Computer Science and Engineering
REVA University
Bengaluru, India
Latha P. H.
Department of Information Science and Engineering
Sambhram Institute of Technology
Bengaluru, India
Abstract: The healthcare applications frequently collect and store the patient data (mostly multivariate) to examine the history of the treatment
and thereby enhance the effectiveness of treatment. The efficient treatment to the patient depends on the performance of the machine learning
models used for analytics tasks of patient data. It is convenient to have a machine learning classification model in a healthcare application to
predict the probability of an observation belonging to each possible class rather than predicting a class value directly for any disease
classification problem. Such predicted probabilities are required to be calibrated to assist the overall support and confidence of any machine
learning classification model used in many healthcare applications. In this paper, the predicted probabilities are studied to diagnose and improve
the calibration of models used for probabilistic classification. The general performance of selected classification models on the two latest wart
skin disease treatment data is also reported.
Keywords: Data Mining, Machine Learning, Classification, Data Analytics, Calibration of Classifiers, Healthcare Systems.
I. INTRODUCTION
In the current era of big data, the technological
advancements are boosting the effectiveness in the healthcare
applications. Today, doctors are well equipped with the results
of advanced analytics performed on the history of patient
records to serve the patients effectively. The electronic
information about the patients provided to doctors must be
increased to enhance the overall effectiveness of the treatment
given to the patients. However, having access to the important
patterns in the patients’ data could be a routine job for any
disease diagnostic expert. The diagnostic experts would
certainly find it handy to understand the patient’s risks in
disease through various patterns found in the readings,
laboratory test results, race, gender, case history, and
socioeconomic standing. Presently, the domain of data
analytics has contributed in various spectrums to understand
and analyze healthcare data [1-3]. Data analytics has proven to
be an effective approach in enhancing the medicinal treatment
for the patients [4], facilitating the great advantage to
clinicians, to enhance the quality of their expert choices during
patient diagnoses. Subsequently, it has contributed to speedy
recovery of patients with cost-effective treatment [1-4].
Machine learning has always been the driving force for data
analytics, and has been very powerful in analyzing massive
data sets that are beyond the normal human capability for
analysis [7-9]. Machine learning has the capability of
converting the analytical results into the information, suitable
for physicians to gain clinical insights that aid them in
designing and providing enhanced health care for patients.
The important applications of the proposed study is
threefold; it aids the statisticians to explore the behavior of
probabilistic classification models towards multivariate data; it
equips the physicians with a tool that assists him/her in
accurate patient diagnosis based on the probabilistic statistics;
and, it aids the ailing patients gain economic medical treatment
and rapid recovery.
The following are the major contributions of the work
proposed:
• Performs Exploratory Data Analysis (EDA) on the
multivariate data
• Builds multiple probabilistic classifiers.
• Performs the comparative study of performance of
well-calibrated classifiers based on several evaluation metrics.
Let, ‘T’ be an unseen outcome of the patient undergoing the
two wart treatments, ‘E’ be the set of records showing the
results of the patients who have undergone the two wart
treatments stored in the form of Comma-Separated-Values, and
‘P’ be the accuracy of classifying ‘T ‘based on ‘E’. Therefore,
the classification problem in the current study can be defined as
developing a machine learning model ‘M’ that gets trained by
all the features (called Experience) present in ‘E’ to predict (a
Task) ‘T’ by improving the accuracy performance ‘P’ during
classification.
The classification (predicting the probabilities) task is an
important machine learning task that enables the predictions
based on the available data sets referred to as history of
treatment. The two data sets of wart treatment therapies chosen
in this study is taken from the UCI Machine Learning
Repository, contributed by the work of [5].
The rest of the paper is structured as follows. Section-II
presents the related work in the field of healthcare data
analytics. Section-III presents the detailed framework used in
this study. Section-IV presents the experimental setup and the
results of exploratory data analysis. Section-V presents detailed
visualization and discussion of results. Section-VI concludes
the work proposed and further extensions of this work.
II. RELATED WORK
There are many research studies conducted in the area of
healthcare data analytics using machine learning. In the work
of [7], many basic machine learning algorithms; such as
Logistic Regression, KNN, Naïve Bayes and Decision trees