Original Research Article Feature projection k-NN classifier model for imbalanced and incomplete medical data Piotr Porwik Q1 Q2 , Tomasz Orczyk * , Marcin Lewandowski, Marcin Cholewa Computer Systems Department, Institute of Computer Science, University of Silesia, ul. Bedzinska 39, 41-200 Sosnowiec, Poland 1. Introduction Data quality, especially medical data have great impact on the performances of many classiers. Data noise can occur when the data has been corrupted by inappropriate measurement or human errors. It can be observed in electronic healthcare records or in hospital medical databases for example [13]. Classication methods extract knowledge from the training datasets. It means that data sets should be credible and complete. Unfortunately not always it is possible, especially when training data come from various sources or when these data refer to the past. It simply shows that classication methods highly depend on the quality of training dataset. In medical information incomplete data (it can be also dened as a noise or missing data) often co-exist with imbalanced b i o c y b e r n e t i c s a n d b i o m e d i c a l e n g i n e e r i n g x x x ( 2 0 1 6 ) x x x x x x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 a r t i c l e i n f o Article history: Received 24 May 2016 Accepted 24 August 2016 Available online xxx Keywords: Liver disease Fibrosis stages Computer aided diagnosis Classiers Features selection methods a b s t r a c t Many datasets, especially various historical medical data are incomplete. Various qualities of data can signicantly hamper medical diagnosis and are bottlenecks of medical support systems. Nowadays, such systems are often used in medical diagnosis. Even great number of data can be unsuitable when data is imbalanced, missing or corrupted. In some cases these troubles can be overcome by machine learning algorithms designed for predictive modeling. Proposed approach was tested on real medical data and some benchmarks dataset form UCI repository. The liver brosis disease from a medical point of view is difcult to treatment and has a signicant social and economic impact. Stages of liver brosis are diagnosed by clinical observation and evaluations, coupled with a so-called METAVIR rating scale. How- ever, these methods may be insufcient, especially in the recognition of phase of the disease. This paper describes a newly developed algorithm to non-invasive brosis stage recognition using machine learning methods a classication model based on feature projection k-NN classier. This solution allows extracting data characteristics from the historical data which may be incomplete and may contain imbalance (unequal) sets of patients. Proposed novel solution is based on peripheral blood analysis without using any specialized biomarkers, and can be successfully included to medical diagnosis support systems and might be a powerful tool for effective estimation of liver brosis stages. # 2016 Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier Sp. z o.o. All rights reserved. * Corresponding author at: Computer Systems Department, Institute of Computer Science, University of Silesia, ul. Bedzinska 39, 41-200 Sosnowiec, Poland. E-mail address: tomasz.orczyk@us.edu.pl (T. Orczyk). BBE 152 1–13 Please cite this article in press as: Porwik P, et al. Feature projection k-NN classier model for imbalanced and incomplete medical data. Biocybern Biomed Eng (2016), http://dx.doi.org/10.1016/j.bbe.2016.08.002 Available online at www.sciencedirect.com ScienceDirect journal homepage: www.elsevier.com/locate/bbe http://dx.doi.org/10.1016/j.bbe.2016.08.002 0208-5216/# 2016 Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier Sp. z o.o. All rights reserved.