 2011 Royal Statistical Society 0035–9254/11/60591 Appl. Statist. (2011) 60, Part 4, pp. 591–605 Subsample ignorable likelihood for regression analysis with missing data Roderick J. Little and Nanhua Zhang University of Michigan, Ann Arbor, USA [Received June 2010. Revised December 2010] Summary. Two common approaches to regression with missing covariates are complete-case analysis and ignorable likelihood methods.We review these approaches and propose a hybrid class, called subsample ignorable likelihood methods, which applies an ignorable likelihood method to the subsample of observations that are complete on one set of variables, but possi- bly incomplete on others. Conditions on the missing data mechanism are presented under which subsample ignorable likelihood gives consistent estimates, but both complete-case analysis and ignorable likelihood methods are inconsistent. We motivate and apply the method proposed to data from the National Health and Nutrition Examination Survey, and we illustrate properties of the methods by simulation. Extensions to non-likelihood analyses are also mentioned. Keywords: Maximum likelihood; Missing data; Multiple imputation; Multivariate regression; Non-ignorable data mechanism 1. Introduction Missing data are an important practical problem in many applications of statistics. We consider multivariate regression with missing data. Reviews of previous research on the topic include Little (1992), Ibrahim et al. (1999, 2002, 2005) and Chen et al. (2008). Three approaches are (a) complete-case (CC) analysis, which discards the incomplete cases, (b) ignorable likelihood (IL) methods, which base inferences on the observed likelihood given a model that does not include a distribution for the missing data mechanism (examples of IL methods include ignorable maximum likelihood (IML), Bayesian inferences, or multi- ple imputation based on the predictive distribution from a Bayesian model (Rubin, 1987), as in SAS PROC MI (SAS Institute, 2010) or IVEware (Raghunathan et al., 2001)) and (c) non-ignorable modelling, which derives inference from the likelihood function based on a joint distribution of the variables and the missing data indicators (this approach is less common in practice, because of the difﬁculty in specifying the model for the missing data mechanism, sensitivity to misspeciﬁcation of this distribution, problems with identify- ing the parameters (Little and Rubin (2002), chapter 15) and lack of widely available software). IL methods have the advantage of retaining all the data, but they assume that the missing data are missing at random (MAR), in the sense that missingness of variables that contain missing values does not depend on the missing values, after conditioning on available data (Rubin, 1976; Address for correspondence: Roderick J. Little, Department of Biostatistics, School of Public Health, Univer- sity of Michigan, Ann Arbor, MI 48109-2029, USA. E-mail: rlittle@umich.edu