Statistics and Its Interface Volume 7 (2014) 87–99 Estimation of rank-tracking probabilities using nonparametric mixed-eﬀects models for longitudinal data Xin Tian * and Colin O. Wu An important scientiﬁc objective of longitudinal studies involves tracking the probability of a subject having cer- tain health status over the course of the study. Proper def- initions and estimates of disease risk tracking have impor- tant implications in the design and analysis of long-term biomedical studies and in developing guidelines for disease prevention and intervention. We study in this paper a class of “rank-tracking probabilities” (RTP) to describe a sub- ject’s conditional probabilities of having certain health out- comes at two diﬀerent time points. Structural nonparamet- ric estimation and inferences for the RTPs and their func- tions are developed based on nonparametric mixed-eﬀects models and B-spline smoothing methods. Statistical proper- ties of our procedures are investigated through a simulation study. We apply our methods to an epidemiological study of childhood cardiovascular risk factors, and demonstrate that the RTPs and their nonparametric estimators provide useful tools to quantitatively evaluate whether the cardiovascular risks, such as obesity and hypertension, can be tracked from early childhood to adolescence. AMS 2000 subject classifications: Primary 62H10, 62G08; secondary 62P10, 65D10. Keywords and phrases: Basis approximation, Condi- tional distribution, Longitudinal study, Mixed model, Time- varying coeﬃcient model, Rank-tracking probability. 1. INTRODUCTION Because the subjects are repeatedly measured over time, longitudinal studies are commonly used in biomedical re- search for the evaluation of population-means or subject- speciﬁc temporal trends of the outcome variables. Most sta- tistical methods in longitudinal analysis, such as the mixed- eﬀects models or nonparametric regression models, are fo- cused on evaluating the eﬀects of time and covariates on the conditional-means of the outcome variables with the po- tential serial correlations taken into account. Recent sum- maries of longitudinal methods can be found, for example, in Verbeke and Molenberghs (2000), Diggle et al. (2002) and * Corresponding author. Fitzmaurice et al. (2009), among others. In addition to the conditional-mean based regression approaches, conditional- distribution or quantile based regression models have also been shown to be an eﬀective tool for the analysis of re- peated measurements data (e.g., Hall et al., 1999; Wei et al., 2006; Wu et al., 2010). These methods focus on evaluating the covariate eﬀects on the distributions of the outcome vari- ables over time, and may lead to better interpretations when the underlying scientiﬁc objectives are speciﬁed by the dis- tribution functions. In addition to the above regression analysis, many biomedical studies require the evaluation of subjects at mul- tiple time points. An important scientiﬁc objective of longi- tudinal studies is to track the likelihood of a subject having certain health status at a later time point given the subject’s health status at an earlier time point. Kavey et al. (2003) discussed the importance of tracking the cardiovascular risk factors over the years beginning in childhood with regard to primary prevention of the subsequent cardiovascular disease in adulthood. The existing statistical methods for longitu- dinal analysis mentioned above, although useful in various settings, do not provide a direct measure for this type of “tracking ability” of disease risk factors. Another class of statistical methods that is somewhat relevant to the con- cept of tracking ability is the estimation of serial correla- tions across diﬀerent time points. Intuitively, if a subject’s health conditions at diﬀerent time points are positively cor- related, then subjects with undesirable health status at an earlier time are expected to be more likely to have undesir- able health status at a later time. Statistical evidence for the strength of correlation is then presented by the estimates of the covariance matrices. Some recent covariance estimation methods are discussed, for example, in Wu and Pourahmadi (2003) and Fan and Wu (2008). Serial correlations, however, may give some evidence of the tracking ability, but are insuf- ﬁcient to be used as a quantitative measure of the likelihood of risk factor tracking over time. The National Growth and Health Study (NGHS) is a good example that illustrates the importance of developing a novel statistical quantity to directly measure the track- ing probability under this context. This is a large epidemi- ological study of childhood growth and cardiovascular risks of 2,379 girls, who were 9 or 10 years old at enrollment,