Special Issue Paper Received 14 November 2013, Accepted 31 May 2014 Published online 1 July 2014 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.6245 Analysis of incidence and prognosis from ‘extreme’ case-control designs Agus Salim, a Xiangmei Ma, b Katja Fall, c,d Ove Andrén c,d and Marie Reilly e * The signifcant investment in measuring biomarkers has prompted investigators to improve cost-effciency by sub-sampling in non-standard study designs. For example, investigators studying prognosis may assume that any differences in biomarkers are likely to be most apparent in an extreme sample of the earliest deaths and the longest-surviving controls. Simple logistic regression analysis of such data does not exploit the information available in the survival time, and statistical methods that model the sampling scheme may be more effcient. We derive likelihood equations that refect the complex sampling scheme in unmatched and matched ‘extreme’ case-control designs. We investigated the performance and power of the method in simulation experiments, with a range of underlying hazard ratios and study sizes. Our proposed method resulted in hazard ratio estimates close to those obtained from the full cohort. The standard error estimates also performed well when compared with the empirical variance. In an application to a study investigating markers for lethal prostate cancer, an extreme case-control sample of lethal cases and the longest-surviving controls provided estimates of the effect of Gleason score in close agreement with analysis of all the data. By using the information in the sampling design, our method enables effcient and valid estimation of the underlying hazard ratio from a study design that is intuitive and easily implemented. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: weighted likelihood; matched design; Cox proportional hazards model; baseline hazard; Kaplan- Meier; logistic regression 1. Introduction The growth of large and well-defned cohorts, including population registers such as the Swedish Health Registers [1] and biobank initiatives such as the U.K. Biobank [2] and the Swedish Life Gene cohort [3], has resulted in large repositories of data and biological samples. The signifcant investment in costly new-technology measurements (such as molecular and genetic information) has led to investigators using various non-standard designs [4–6] that are perceived to make optimal use of the information available in the data. These designs continue to evolve in response to various needs of research, such as savings in time and cost, targeting of informative individuals, or allowing fexibility in the sampling and opti- mal/ethical use of biological material contributed by volunteers. However, some of the designs have not been validated, and their adoption is motivated by pragmatic concerns, without clear recognition of the dual role of sampling design and appropriate statistical analysis in maximizing validity and precision. For example, data collected in case-control (CC) or nested CC (NCC) studies are sometimes reused to address new research questions in the same population, but the statistical analysis is regularly performed with naive methods [7,8] ignoring the long-recognized importance of incorporating the process of control selection in the analysis [9]. Naive statistical methods were also used in a recent study of prostate cancer where the authors proposed an ‘extreme CC design’ (ECC) for sampling patients [10]. Cases were patients who died from prostate a La Trobe University, Melbourne, Victoria 3086, Australia b Saw Swee Hock School of Public Health, National University of Singapore, Singapore c Clinical Epidemiology and Biostatistics, Örebro University Hospital, Örebro, Sweden d School of Health and Medical Sciences, Örebro University, Örebro, Sweden e Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden * Correspondence to: Marie Reilly, Karolinska Institutet, Stockholm, Sweden. E-mail: marie.reilly@ki.se 5388 Copyright © 2014 John Wiley & Sons, Ltd. Statist. Med. 2014, 33 5388–5398