Tools for Statistical Analysis with Missing
Data: Application to a Large Medical
Database
Cristian Preda
a
, Alain Duhamel
a
, Monique Picavet
a
, Tahar Kechadi
b
a
Faculté de Médecine, France
b
Department of Computer Science, University College of Dublin
Abstract
Missing data is a common feature of large data sets in general and medical data
sets in particular. Depending on the goal of statistical analysis, various techniques
can be used to tackle this problem. Imputation methods consist in substituting the
missing values with plausible or predicted values so that the completed data can
then be analysed with any chosen data mining procedure. In this work, we study
imputation in the context of multivariate data and we evaluate a number of methods
which can be used by today's standard statistical software packages. Imputation
using multivariate classification, multiple imputation and imputation by factorial
analysis are compared using simulated data and a large medical database (from the
diabetes field) with numerous missing values. Our main result is to provide a control
chart for assessing data quality after the imputation process. To this end, we
developed an algorithm for which the input is a set of parameters describing the
underlying data (e.g., covariance matrix, distribution) and the output is a chart
which plots the change in the prediction error with respect to the proportion of
missing values. The chart is built by means of an iterative algorithm involving four
steps: (1) a sample of simulated data is drawn by using the input parameters; (2)
missing values are randomly generated; (3) an imputation method is used to fill in
the missing data and (4) the prediction error is computed. Steps 1 to 4 are repeated
in order to estimate the distribution of the prediction error. The control chart was
established for the 3 imputation methods studied here, assuming a multivariate
normal distribution of data. The use of this tool on a large medical database was
then investigated. We show how the control chart can be used to assess the quality
of the imputation process in the pre-processing step upstream of data mining
procedures.
Keywords:
Statistical models; Databases; Data mining; Missing values; Imputation;
1. Introduction
Dealing with missing data is a major problem in Knowledge Discovery in Databases
(KDD). This type of operation must be performed with caution in order to avoid
deterioration in the performance of data mining procedures. The area has attracted much
research interest over recent years and the mainstream statistical analysis software packages
are starting to offer solutions (Celeux [1], Hox [2]). Dealing with missing data in the KDD
process comprises three main strategies. The first consists in eliminating incomplete
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
181
Section 3: Decision Support and Clinical Guidelines