International Journal of Management & Information Systems – Third Quarter 2012 Volume 16, Number 3
© 2012 The Clute Institute http://www.cluteinstitute.com/ 215
Does Removing/Replacing Missing
Values Improve The Models'
Classification Performances?
Jozef Zurada, University of Louisville, USA
ABSTRACT
The paper explores the effect of removing/replacing missing values on the classification
performance of several models. The original data set, which contains a relatively large number of
missing values, comes from the credit scoring context. This data set was not used to build the
models, but it was converted to five other data sets with missing values either removed or replaced
using different techniques. The models were built and tested on the five data sets. Preliminary
computer simulation showed that the models created and tested on the four data sets in which
missing values were replaced exhibited significantly better predictive performance than the model
built and tested on the data set with missing values removed.
Keywords: Classification Models; Credit Scoring Context; Missing Values Replacement/Removal; Improved
Predictive Accuracy
INTRODUCTION
issing values for one or more attributes in large data sets are quite common. They can be a
byproduct of data collection errors or incomplete customer responses, to name a few. Past research
and multiple experimentations have shown that data reduction and transformation such as
sampling, feature elimination, and value reduction such as binning or smoothing the values that each feature takes,
may improve the prediction accuracy of the models and makes them simpler to interpret (Berry and Linoff, 2004;
Frank and Witten, 2005; Giudici, 2003; Han and Kamber, 2001; Hand et al., 2001; Kantardzic, 2011; Larose, 2005;
Olson and Shi, 2007; Pyle, 1999).
This paper examines the effect of removing and replacing missing values on the global classification
performance of several models. The models examined are logistic regression (LR), neural networks (NN), support
vector machines (SVM), k-nearest neighbor (kNN), and decision trees (DT). The areas under ROC charts are used as
the criterion of the models' performances. The larger the areas, the better the models.
The original data set was drawn from the credit scoring context and contained 5,960 records and 13
variables, as well as significant number of missing values. The original data set was not used to build and test the
models. However, from this data set, five other data sets were created. In the first data set, all records, which had at
least one missing value for any variable, were removed. In addition, four other data sets were created by using
different missing values replacement techniques. Initial computer simulation shows that the models built and tested
on the data sets, with missing values replaced, perform significantly better than the model built and tested on the
data set from which missing values were removed.
MISSING VALUES IMPUTATION METHODS
Modelers have to make assumptions about the missing data to select the best missing value replacement
algorithm. For example, modelers often replace a missing value with the arithmetic average, median, mode or
another measure of the central tendency of the attribute for the given class. These techniques assume that the
M