BIOMETRICS 56, 384-388 zyxwvutsrqp June 2000 Conditional and Unconditional Categorical Regression Models with Missing Covariates Glen A. Satten zyxwv email: gaso@cdc.gov and Raymond J. Carroll Centers for Disease Control and Prevention, Atlanta, Georgia Department of Statistics and Department of Biostatistics and Texas A&M University, College Station, Texas 77843-3143, 30333, U.S.A. Epidemiology, U.S.A., and Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pennsylvania 19 104, U.S .A. SUMMARY. We consider methods for analyzing categorical regression models when some covariates zyx (2) are completely observed but other covariates zyxwvut (X) are missing for some subjects. When data on X are missing at random (i.e., when the probability that X is observed does not depend on the value of X itself), we present a likelihood approach for the observed data that allows the same nuisance parameters to be eliminated in a conditional analysis as when data are complete. An example of a matched casecontrol study is used to demonstrate our approach. KEY WORDS: Case-control study; Endometrial cancer; Likelihood; Matching; Missing at random; Missing data; Two-stage sample. 1. Introduction A common problem in categorical data analysis is to deter- mine the effect of explanatory variables V on a binary out- come D of interest. In addition, the study may call for a design in which a conditional analysis is used to eliminate nuisance parameters, such as in a matched analysis. However, some components of V may not be measured for all study subjects. For example, in a casecontrol study of endometrial cancer conducted among residents of the Leisure World retirement community, a binary variable denoting obesity was missing in approximately 16% of respondents. Hence, a missing-data approach is required for the analysis of this variable. Because the study design matched each case with four controls, the missing-data approach taken must remain valid for highly stratified data. Data on a subset of covariates may be missing for a va- riety of reasons. For example, in epidemiologic studies using convenience samples, the medical records used to determine covariate values may not be complete for all participants. In some studies, data on a subset of covariates may be miss- ing by design. This is the case in studies that use two-stage sampling, gathering simple-to-measure covariates on all study participants and then only gathering information on complex or expensive-to-measure covariates on a subset of study par- ticipants. In either case, a methodology that allows for differ- zyxwvu , ent rates of missingness as a function of the observed covari- ates is highly desirable, especially as complete-case analyses (analyses in which only data from participants with complete information are used) are known to yield biased results when data are not missing completely at random (Little and Rubin, 1987). In addition, a methodology that allows for elimination of nuisance parameters through conditioning is also essential for analyses of highly stratified data such as matched case- control studies. Satten and Kupper (1993a,b) developed an approach to categorical regression analyses in which covariate information was missing for some people and in which surrogate variables were to be used in place of the effect of missing covariate information. This methodology required the nondifferential errors approximation, but nuisance parameters could be re- moved in conditional analyses such as for matched sets. The purpose of this paper is to show that, if surrogate variables are not used, the Satten and Kupper approach is exact and provides a likelihood-based approach to categorical missing- data problems in which some covariates are missing for some study participants. As described in Section zyx 5, our work can be described as a generalization of that of Paik and Sacco In Section 2, we state the results of Satten and Kupper (1993) adapted for the missing-data case. In Section 3, we con- sider maximum likelihood estimation for unconditional and (2000). 384