Efficient Monte Carlo Methods for Conditional Logistic Regression Cyrus R. MEHTA, Nitin R. PATEL, and Pralay SENCHAUDHURI Exact inference for the logistic regression model is based on generating the permutation distribution of the sufficient statistics for the regression parameters of interest conditional on the sufficient statistics for the remaining (nuisance) parameters. Despite the availability of fast numerical algorithms for the exact computations, there are numerous instances where a data set is too large to be analyzed by the exact methods, yet too sparse or unbalanced for the maximum likelihood approach to be reliable. What is needed is a Monte Carlo alternative to the exact conditional approach which can bridge the gap between the exact and asymptotic methods of inference. The problem is technically hard because conventional Monte Carlo methods lead to massive rejection of samples that do not satisfy the linear integer constraints of the conditional distribution. We propose a network sampling approach to the Monte Carlo problem that eliminates rejection entirely. Its advantages over alternative saddlepoint and Markov Chain Monte Carlo approaches are also discussed. KEY WORDS: Exact Logistic Regression; MCMC; Network Algorithms; Single Saddlepoint; Smart Monte Carlo. 1. INTRODUCTION Logistic regression is a popular mathematical model for the analysis of binary data with widespread applicability in the physical, biomedical, and behavioral sciences. Pa- rameter inference for this model is usually based on max- imizing the unconditional likelihood function. For large well-balanced datasets or for datasets with only a few pa- rameters, unconditional maximum likelihood inference is a satisfactory approach. However unconditional maximum likelihood inference can produce inconsistent point esti- mates, inaccurate p values, and inaccurate confidence in- tervals for small or unbalanced datasets and for datasets with a large number of parameters relative to the num- ber of observations. Sometimes the method fails entirely as no estimates can be found that maximize the uncondi- tional likelihood function. A methodologically sound al- ternative approach that has none of the aforementioned drawbacks is the exact conditional approach. Here one es- timates the parameters of interest by computing the exact permutation distributions of their sufficient statistics, condi- tional on the observed values of the sufficient statistics for the remaining “nuisance” parameters. The major stumbling block to exact permutational inference has always been the heavy computational burden that it imposes. Important con- tributions to this computational problem have been made by Baglivo, Pagano, and Spino (1996), Hirji (1992), Hirji, Mehta, and Patel (1987, 1988), and Tritchler (1984). De- spite the availability of fast numerical algorithms for the exact computations, there are numerous instances where a dataset is too large to be analyzed by the exact methods, yet too sparse or imbalanced for the maximum likelihood approach to be reliable. What is needed is a Monte Carlo alternative to the exact conditional approach that can bridge the gap between the exact and asymptotic methods of in- ference. The problem is technically difficult, because con- Cyrus R. Mehta is President, Nitin R. Patel is Vice President, and Pralay Senchaudhuri is Director of Research and Development, Cytel Software Corporation, Cambridge, MA 02139 (E-mail: mehta@cytel.com). Patel is also Visiting Professor, Sloan School of Management, Massachusetts In- stitute of Technology, Cambridge, MA 02139. ventional Monte Carlo methods lead to massive rejection of samples that do not satisfy the constraints of the con- ditional distribution. We develop a network-based direct Monte Carlo sampling approach that eliminates rejection entirely. We then show how, as a byproduct of the network algorithm, one may compute single saddlepoint approxima- tions to the exact permutation distributions. Finally, we dis- cuss the advantages and limitations of direct Monte Carlo sampling relative to Monte Carlo sampling on Markov chains. 2. FORMULATION OF THE DIRECT MONTE CARLO SAMPLING PROBLEM Let Y =(Y 1 ,Y 2 ,...Y g ) be g independent binomial ran- dom variables, where Y j represents the number of responses in n j Bernoulli trials each having a response probability of π j , and let y * =(y * 1 ,y * 2 ,...y * g ) be the value of Y actu- ally observed. Suppose that the response probabilities are specified by the logistic regression model log π j 1 - π j = λa j + θw j , (1) where λ = (λ 1 ,λ 2 ,...λ c ) and θ = (θ 1 ,θ 2 ,...θ d ) are unknown model parameters and a j =(a j1 ,a j2 ,...a jc ) 0 and w j =(w j1 ,w j2 ,...w jd ) 0 are corresponding covariates. Then the likelihood function, or probability of the observed y * given (λ, θ), is Pr{Y = y * |λ, θ} = Q g j=1 n j y * j exp(λa j y * j + θw j y * j ) Q g j=1 [1 + exp(λa j + θw j )] nj . (2) If we are interested in making inferences about θ and re- gard λ as a nuisance parameter, then we may eliminate λ from the likelihood function by conditioning on its suffi- c 2000 American Statistical Association Journal of the American Statistical Association March 2000, Vol. 95, No. 449, Theory and Methods 99