Computational Statistics and Data Analysis 82 (2015) 126–136 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Cook’s distance for generalized linear mixed models Luis Gustavo B. Pinho a , Juvêncio S. Nobre a,∗ , Julio M. Singer b a Universidade Federal do Ceará, DEMA, Campus do Pici. Fortaleza, CE, 60440-900, Brazil b Universidade de São Paulo, IME, Rua do Matão, 1010. São Paulo, SP, 05508-090, Brazil article info Article history: Received 22 November 2013 Received in revised form 12 August 2014 Accepted 15 August 2014 Available online 1 September 2014 Keywords: Diagnostics GLMM Influence Leverage abstract We consider an extension of Cook’s distance for generalized linear mixed models with the objective of identifying observations with high influence in the predicted conditional means of the response variable. The proposed distance can be decomposed into factors that help to distinguish between influence on the estimation of fixed effects and on the predic- tion of random effects. Joint and conditional influence are also considered. A first-order approximation is proposed for more efficient computation and a Monte Carlo simulation is considered to evaluate the efficacy of the proposal. An application to a dataset obtained from the literature is presented to show how such tools can be used in practice. © 2014 Elsevier B.V. All rights reserved. 1. Introduction By including random effects in the linear predictor, Generalized Linear Mixed Models (GLMMs) constitute a flexible tool to analyze data using distributions in the exponential family. In repeated measures studies, for example, where each unit may contribute with more than one observation, the random effects allow the modeling of individual unit behavior (Zeger et al., 1988). This class of models is also useful to analyze overdispersed data (Breslow, 1984). This, however, is accomplished at the expense of a more complicated maximum likelihood estimation process since it may be necessary to integrate over several dimensions. For details on GLMMs, the reader is referred to Breslow and Clayton (1993), among others. Whenever statistical models are considered, care should be taken to verify their assumptions and adequacy to the data. Diagnostic tools developed for such purposes may be classified in two broad categories. The first, termed residual analysis, is useful to verify assumptions about the distributions of the random elements and to identify observations (or units) with atypical values. The second, called sensitivity analysis, is employed to evaluate the behavior of the components of the model and predicted values when observations (or units) are perturbed or deleted. In the context of traditional linear models (normal, homoskedastic and independent observations), diagnostic methods have been addressed by many authors, among which we mention Cook (1977), Hoaglin and Welsch (1978), Belsley et al. (1980) and Cook and Weisberg (1982). Extensions and generalizations to linear mixed models are considered in Beckman et al. (1987), Hilden-Minton (1995), Lesaffre and Verbeke (1998), Tan et al. (2001), Demidenko (2004), Demidenko and Stukel (2005), Nobre and Singer (2007), Gumedze et al. (2010) and Nobre and Singer (2011), among others. Diagnostics for GLMMs are still not fully explored; some attempts have been made by Xiang et al. (2002), Zhu and Lee (2003), Tchetgen and Coull (2006) and Abad et al. (2010). Using an approach similar to the one in Tan et al. (2001), we extend the ideas of Xiang et al. (2002) to allow evaluation of the influence of observations on both the estimation of fixed effects and prediction of random effects separately. This is an important step when GLMMs are used for prediction purposes. ∗ Corresponding author. Tel.: +55 85 33669155. E-mail addresses: juvencio@ufc.br, juvenciosantos@gmail.com (J.S. Nobre), jmsinger@ime.usp.br (J.M. Singer). http://dx.doi.org/10.1016/j.csda.2014.08.008 0167-9473/© 2014 Elsevier B.V. All rights reserved.