DOI: http://dx.doi.org/10.26483/ijarcs.v12i2.6696 Volume 12, No. 2, March-April 2021 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info © 2020-2022, IJARCS All Rights Reserved 39 ISSN No. 0976-5697 WELL-CALIBRATED PROBABILISTIC MACHINE LEARNING CLASSIFIERS FOR MULTIVARIATE HEALTHCARE DATA Akram Pasha School of Computer Science and Engineering REVA University Bengaluru, India Latha P. H. Department of Information Science and Engineering Sambhram Institute of Technology Bengaluru, India Abstract: The healthcare applications frequently collect and store the patient data (mostly multivariate) to examine the history of the treatment and thereby enhance the effectiveness of treatment. The efficient treatment to the patient depends on the performance of the machine learning models used for analytics tasks of patient data. It is convenient to have a machine learning classification model in a healthcare application to predict the probability of an observation belonging to each possible class rather than predicting a class value directly for any disease classification problem. Such predicted probabilities are required to be calibrated to assist the overall support and confidence of any machine learning classification model used in many healthcare applications. In this paper, the predicted probabilities are studied to diagnose and improve the calibration of models used for probabilistic classification. The general performance of selected classification models on the two latest wart skin disease treatment data is also reported. Keywords: Data Mining, Machine Learning, Classification, Data Analytics, Calibration of Classifiers, Healthcare Systems. I. INTRODUCTION In the current era of big data, the technological advancements are boosting the effectiveness in the healthcare applications. Today, doctors are well equipped with the results of advanced analytics performed on the history of patient records to serve the patients effectively. The electronic information about the patients provided to doctors must be increased to enhance the overall effectiveness of the treatment given to the patients. However, having access to the important patterns in the patients’ data could be a routine job for any disease diagnostic expert. The diagnostic experts would certainly find it handy to understand the patient’s risks in disease through various patterns found in the readings, laboratory test results, race, gender, case history, and socioeconomic standing. Presently, the domain of data analytics has contributed in various spectrums to understand and analyze healthcare data [1-3]. Data analytics has proven to be an effective approach in enhancing the medicinal treatment for the patients [4], facilitating the great advantage to clinicians, to enhance the quality of their expert choices during patient diagnoses. Subsequently, it has contributed to speedy recovery of patients with cost-effective treatment [1-4]. Machine learning has always been the driving force for data analytics, and has been very powerful in analyzing massive data sets that are beyond the normal human capability for analysis [7-9]. Machine learning has the capability of converting the analytical results into the information, suitable for physicians to gain clinical insights that aid them in designing and providing enhanced health care for patients. The important applications of the proposed study is threefold; it aids the statisticians to explore the behavior of probabilistic classification models towards multivariate data; it equips the physicians with a tool that assists him/her in accurate patient diagnosis based on the probabilistic statistics; and, it aids the ailing patients gain economic medical treatment and rapid recovery. The following are the major contributions of the work proposed: Performs Exploratory Data Analysis (EDA) on the multivariate data Builds multiple probabilistic classifiers. Performs the comparative study of performance of well-calibrated classifiers based on several evaluation metrics. Let, ‘T’ be an unseen outcome of the patient undergoing the two wart treatments, ‘E’ be the set of records showing the results of the patients who have undergone the two wart treatments stored in the form of Comma-Separated-Values, and ‘P’ be the accuracy of classifying ‘T ‘based on ‘E’. Therefore, the classification problem in the current study can be defined as developing a machine learning model ‘M’ that gets trained by all the features (called Experience) present in ‘E’ to predict (a Task) ‘T’ by improving the accuracy performance ‘P’ during classification. The classification (predicting the probabilities) task is an important machine learning task that enables the predictions based on the available data sets referred to as history of treatment. The two data sets of wart treatment therapies chosen in this study is taken from the UCI Machine Learning Repository, contributed by the work of [5]. The rest of the paper is structured as follows. Section-II presents the related work in the field of healthcare data analytics. Section-III presents the detailed framework used in this study. Section-IV presents the experimental setup and the results of exploratory data analysis. Section-V presents detailed visualization and discussion of results. Section-VI concludes the work proposed and further extensions of this work. II. RELATED WORK There are many research studies conducted in the area of healthcare data analytics using machine learning. In the work of [7], many basic machine learning algorithms; such as Logistic Regression, KNN, Naïve Bayes and Decision trees