Original Research Article Feature projection k-NN classifier model for imbalanced and incomplete medical data Piotr Porwik Q1 Q2 , Tomasz Orczyk * , Marcin Lewandowski, Marcin Cholewa Computer Systems Department, Institute of Computer Science, University of Silesia, ul. Bedzinska 39, 41-200 Sosnowiec, Poland 1. Introduction Data quality, especially medical data have great impact on the performances of many classifiers. Data noise can occur when the data has been corrupted by inappropriate measurement or human errors. It can be observed in electronic healthcare records or in hospital medical databases for example [1–3]. Classification methods extract knowledge from the training datasets. It means that data sets should be credible and complete. Unfortunately not always it is possible, especially when training data come from various sources or when these data refer to the past. It simply shows that classification methods highly depend on the quality of training dataset. In medical information incomplete data (it can be also defined as a noise or missing data) often co-exist with imbalanced b i o c y b e r n e t i c s a n d b i o m e d i c a l e n g i n e e r i n g x x x ( 2 0 1 6 ) x x x – x x x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 a r t i c l e i n f o Article history: Received 24 May 2016 Accepted 24 August 2016 Available online xxx Keywords: Liver disease Fibrosis stages Computer aided diagnosis Classifiers Features selection methods a b s t r a c t Many datasets, especially various historical medical data are incomplete. Various qualities of data can significantly hamper medical diagnosis and are bottlenecks of medical support systems. Nowadays, such systems are often used in medical diagnosis. Even great number of data can be unsuitable when data is imbalanced, missing or corrupted. In some cases these troubles can be overcome by machine learning algorithms designed for predictive modeling. Proposed approach was tested on real medical data and some benchmarks dataset form UCI repository. The liver fibrosis disease from a medical point of view is difficult to treatment and has a significant social and economic impact. Stages of liver fibrosis are diagnosed by clinical observation and evaluations, coupled with a so-called METAVIR rating scale. How- ever, these methods may be insufficient, especially in the recognition of phase of the disease. This paper describes a newly developed algorithm to non-invasive fibrosis stage recognition using machine learning methods – a classification model based on feature projection k-NN classifier. This solution allows extracting data characteristics from the historical data which may be incomplete and may contain imbalance (unequal) sets of patients. Proposed novel solution is based on peripheral blood analysis without using any specialized biomarkers, and can be successfully included to medical diagnosis support systems and might be a powerful tool for effective estimation of liver fibrosis stages. # 2016 Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier Sp. z o.o. All rights reserved. * Corresponding author at: Computer Systems Department, Institute of Computer Science, University of Silesia, ul. Bedzinska 39, 41-200 Sosnowiec, Poland. E-mail address: tomasz.orczyk@us.edu.pl (T. Orczyk). BBE 152 1–13 Please cite this article in press as: Porwik P, et al. Feature projection k-NN classifier model for imbalanced and incomplete medical data. Biocybern Biomed Eng (2016), http://dx.doi.org/10.1016/j.bbe.2016.08.002 Available online at www.sciencedirect.com ScienceDirect journal homepage: www.elsevier.com/locate/bbe http://dx.doi.org/10.1016/j.bbe.2016.08.002 0208-5216/# 2016 Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier Sp. z o.o. All rights reserved.