January 7, 2010 16:50 Journal of Statistical Computation & Simulation IMPmethods08 Journal of Statistical Computation & Simulation Vol. 00, No. 00, Month 2009, 1–26 RESEARCH ARTICLE A Comparison of Various Software Tools for Dealing with Missing Data via Imputation Jos´ e Corti˜ nas Abrahantes a ,Cristina Sotto a,b , Geert Molenberghs a,c , Geert Vromman d and Bart Bierinckx d a Interuniversity Institute for Biostatistics and statistical Bioinformatics, Universiteit Hasselt, Agoralaan 1, B-3590 Diepenbeek, Belgium b School of Statistics, University of the Philippine, Diliman, Quezon City, Philippines c Interuniversity Institute for Biostatistics and statistical Bioinformatics, Katholieke Universiteit Leuven, Kapucijnenvoer 35, B-3000 Leuven, Belgium d IM Associates BVBA, Sales and Marketing Effectiveness, Brusselsesteenweg 52, B-3000 Leuven, Belgium (released November 2009) In real life situations, we often encounter data sets containing missing observations. Statis- tical methods that address missingness have been extensively studied in recent years. One of the more popular approaches involves imputation of the missing values prior to the anal- ysis, thereby rendering the data complete. Imputation broadly encompasses an entire scope of techniques that have been developed to make inferences about incomplete data, ranging from very simple strategies (e.g., mean imputation) to more advanced approaches that re- quire estimation, for instance, of posterior distributions using MCMC methods. Additional complexity arises when the number of missingness patterns increases and/or when both cate- gorical and continuous random variables are involved. Implementation of routines, procedures, or packages capable of generating imputations for incomplete data are now widely available. We review some of these in the context of a motivating example, as well as in a simulation study, under two missingness mechanisms (missing at random; missing not at random). Thus far, evaluation of existing implementations have frequently centered on the resulting param- eter estimates of the prescribed model of interest after imputing the missing data. In some situations, however, interest may very well be on the quality of the imputed values at the level of the individual – an issue that has received relatively little attention. In this paper, we focus on the latter to provide further insight about the performance of the different routines, procedures and packages in this respect. Keywords: multiple imputation; missing data; missing at random; missing not at random; random forest. 1. Introduction Missing data are frequently encountered in any real-world study. Many different circumstances can give rise to incomplete data. Whereas in some cases, the design of the experiment or survey itself can induce missingness, in other situations, missing values are just due to chance. It is also possible and probable that some variables are not collected from all subjects, or that some subjects drop out of the study, or even that some information is left out, for example, for reasons of confidentiality. While the use of complete-case methods, i.e., methods that exclude subjects with * Corresponding author. Email: jose.cortinas@uhasselt.be ISSN: 0094-9655 print/ISSN 1563-5163 online c 2009 Taylor & Francis DOI: 10.1080/0094965YYxxxxxxxx http://www.informaworld.com