Original Research Article Feature projection k-NN classifier model for imbalanced and incomplete medical data Piotr Porwik Q1 Q2 , Tomasz Orczyk * , Marcin Lewandowski, Marcin Cholewa Computer Systems Department, Institute of Computer Science, University of Silesia, ul. Bedzinska 39, 41-200 Sosnowiec, Poland 1. Introduction Data quality, especially medical data have great impact on the performances of many classiﬁers. Data noise can occur when the data has been corrupted by inappropriate measurement or human errors. It can be observed in electronic healthcare records or in hospital medical databases for example [1–3]. Classiﬁcation methods extract knowledge from the training datasets. It means that data sets should be credible and complete. Unfortunately not always it is possible, especially when training data come from various sources or when these data refer to the past. It simply shows that classiﬁcation methods highly depend on the quality of training dataset. In medical information incomplete data (it can be also deﬁned as a noise or missing data) often co-exist with imbalanced b i o c y b e r n e t i c s a n d b i o m e d i c a l e n g i n e e r i n g x x x ( 2 0 1 6 ) x x x – x x x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 a r t i c l e i n f o Article history: Received 24 May 2016 Accepted 24 August 2016 Available online xxx Keywords: Liver disease Fibrosis stages Computer aided diagnosis Classiﬁers Features selection methods a b s t r a c t Many datasets, especially various historical medical data are incomplete. Various qualities of data can signiﬁcantly hamper medical diagnosis and are bottlenecks of medical support systems. Nowadays, such systems are often used in medical diagnosis. Even great number of data can be unsuitable when data is imbalanced, missing or corrupted. In some cases these troubles can be overcome by machine learning algorithms designed for predictive modeling. Proposed approach was tested on real medical data and some benchmarks dataset form UCI repository. The liver ﬁbrosis disease from a medical point of view is difﬁcult to treatment and has a signiﬁcant social and economic impact. Stages of liver ﬁbrosis are diagnosed by clinical observation and evaluations, coupled with a so-called METAVIR rating scale. How- ever, these methods may be insufﬁcient, especially in the recognition of phase of the disease. This paper describes a newly developed algorithm to non-invasive ﬁbrosis stage recognition using machine learning methods – a classiﬁcation model based on feature projection k-NN classiﬁer. This solution allows extracting data characteristics from the historical data which may be incomplete and may contain imbalance (unequal) sets of patients. Proposed novel solution is based on peripheral blood analysis without using any specialized biomarkers, and can be successfully included to medical diagnosis support systems and might be a powerful tool for effective estimation of liver ﬁbrosis stages. # 2016 Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier Sp. z o.o. All rights reserved. * Corresponding author at: Computer Systems Department, Institute of Computer Science, University of Silesia, ul. Bedzinska 39, 41-200 Sosnowiec, Poland. E-mail address: tomasz.orczyk@us.edu.pl (T. Orczyk). BBE 152 1–13 Please cite this article in press as: Porwik P, et al. Feature projection k-NN classiﬁer model for imbalanced and incomplete medical data. Biocybern Biomed Eng (2016), http://dx.doi.org/10.1016/j.bbe.2016.08.002 Available online at www.sciencedirect.com ScienceDirect journal homepage: www.elsevier.com/locate/bbe http://dx.doi.org/10.1016/j.bbe.2016.08.002 0208-5216/# 2016 Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier Sp. z o.o. All rights reserved.