International Journal of Management & Information Systems – Third Quarter 2012 Volume 16, Number 3 © 2012 The Clute Institute http://www.cluteinstitute.com/ 215 Does Removing/Replacing Missing Values Improve The Models' Classification Performances? Jozef Zurada, University of Louisville, USA ABSTRACT The paper explores the effect of removing/replacing missing values on the classification performance of several models. The original data set, which contains a relatively large number of missing values, comes from the credit scoring context. This data set was not used to build the models, but it was converted to five other data sets with missing values either removed or replaced using different techniques. The models were built and tested on the five data sets. Preliminary computer simulation showed that the models created and tested on the four data sets in which missing values were replaced exhibited significantly better predictive performance than the model built and tested on the data set with missing values removed. Keywords: Classification Models; Credit Scoring Context; Missing Values Replacement/Removal; Improved Predictive Accuracy INTRODUCTION issing values for one or more attributes in large data sets are quite common. They can be a byproduct of data collection errors or incomplete customer responses, to name a few. Past research and multiple experimentations have shown that data reduction and transformation such as sampling, feature elimination, and value reduction such as binning or smoothing the values that each feature takes, may improve the prediction accuracy of the models and makes them simpler to interpret (Berry and Linoff, 2004; Frank and Witten, 2005; Giudici, 2003; Han and Kamber, 2001; Hand et al., 2001; Kantardzic, 2011; Larose, 2005; Olson and Shi, 2007; Pyle, 1999). This paper examines the effect of removing and replacing missing values on the global classification performance of several models. The models examined are logistic regression (LR), neural networks (NN), support vector machines (SVM), k-nearest neighbor (kNN), and decision trees (DT). The areas under ROC charts are used as the criterion of the models' performances. The larger the areas, the better the models. The original data set was drawn from the credit scoring context and contained 5,960 records and 13 variables, as well as significant number of missing values. The original data set was not used to build and test the models. However, from this data set, five other data sets were created. In the first data set, all records, which had at least one missing value for any variable, were removed. In addition, four other data sets were created by using different missing values replacement techniques. Initial computer simulation shows that the models built and tested on the data sets, with missing values replaced, perform significantly better than the model built and tested on the data set from which missing values were removed. MISSING VALUES IMPUTATION METHODS Modelers have to make assumptions about the missing data to select the best missing value replacement algorithm. For example, modelers often replace a missing value with the arithmetic average, median, mode or another measure of the central tendency of the attribute for the given class. These techniques assume that the M