Biometrics 61, 715–720 September 2005 DOI: 10.1111/j.1541-0420.2005.00337.x Simultaneous Group Sequential Analysis of Rank-Based and Weighted Kaplan–Meier Tests for Paired Censored Survival Data Adin-Cristian Andrei ∗ and Susan Murray Department of Biostatistics, University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48109, U.S.A. ∗ email: andreia@umich.edu Summary. This research sequentially monitors paired survival differences using a new class of nonpara- metric tests based on functionals of standardized paired weighted log-rank (PWLR) and standardized paired weighted Kaplan–Meier (PWKM) tests. During a trial, these tests may alternately assume the role of the more extreme statistic. By monitoring PEMAX, the maximum between the absolute values of the stan- dardized PWLR and PWKM, one combines advantages of rank-based (RB) and non-RB paired testing paradigms. Simulations show that monitoring treatment differences using PEMAX maintains type I error and is nearly as powerful as using the more advantageous of the two tests in proportional hazards (PH) as well as non-PH situations. Hence, PEMAX preserves power more robustly than individually monitored PWLR and PWKM, while maintaining a reasonably simple approach to design and analysis of results. An example from the Early Treatment Diabetic Retinopathy Study (ETDRS) is given. Key words: Clinical trials; Group sequential monitoring; Nonparametric; Paired weighted Kaplan–Meier; Paired weighted log-rank. 1. Introduction At the design stages of clinical trials comparing survival out- comes in independent groups, a common plan is to base the design upon a log-rank (LR) statistic of some form (see, for example, Gehan, 1965; Gill, 1980). Another approach for stochastically ordered alternatives is to compare areas un- der survival curves (see, for example, Pepe and Fleming, 1989). Versatile tests combining rank-based (RB) and non- RB statistics for independent groups are studied by Chi and Tsai (2001), while Kosorok and Lin (1999) develop sophisti- cated methods for combining various RB tests. Fundamental independent group sequential methods for families of weighted LR (WLR) tests have been developed and studied by Tsiatis (1981, 1982), Sellke and Siegmund (1983), Slud (1984), and Gu and Lai (1991), among others, and sequential methods for comparing areas under survival curves were developed by Murray and Tsiatis (1999). For paired censored survival data, where optimality prop- erties for the paired WLR (PWLR) have not been studied, competing methodologies exist to a lesser extent. Some RB and frailty methods are presented by O’Brien and Fleming (1987), Dabrowska (1986, 1990), Murray (2000), and Oakes and Jeong (1998), among others, and paired Pepe–Fleming tests are developed by Murray (2001, 2002). Paired survival data arise in various situations including time to death, dis- ease occurrence or other morbidity in twins, time to vision loss in paired eyes, or failure of matched allografts. For exam- ple, 3711 patients with diabetic retinopathy in both eyes were enrolled in the Early Treatment Diabetic Retinopathy Study (ETDRS, 1991a,b) from April 1980 to July 1985, with one eye per patient randomly assigned to early photocoagulation and the other to deferral of photocoagulation until detection of high-risk proliferative retinopathy. In paired settings such as ETDRS, little research involving multiple test statistics is available. Further complicating the design choice in the group sequential setting, the preferred test may change from one interim analysis to the next. This research is motivated by a desire to formalize inference in the following scenario. Assume that in a paired censored survival analysis with group sequential monitoring, an investi- gator first uses a PWLR and fails to reject the null hypothesis by a narrow margin. Then, a paired weighted Kaplan–Meier (PWKM) test is recalled as an attractive alternative and it leads to statistical significance. Or perhaps at different analy- sis times, statistical advantages are attributed alternately to PWLR or PWKM. In this setting, we provide a middle ground that allows monitoring of both tests, while adjusting for their joint use over time. The proposed test, PEMAX, which is the maximum of the absolute values of the standardized PWLR and PWKM, will be seen to preserve type I error and to have power comparable to the better of these competing tests. The rest of this article is organized as follows. In Sec- tion 2, the sequential joint limiting distribution of PWLR and PWKM is outlined, from which the joint distribution of PEMAX over time is estimated. Although this section is 715