Genomic screening and replication using the same data set in family-based association testing Kristel Van Steen 1 , Matthew B McQueen 2 , Alan Herbert 3 , Benjamin Raby 4 , Helen Lyon 4,5 , Dawn L DeMeo 4 , Amy Murphy 1 , Jessica Su 2 , Soma Datta 4 , Carsten Rosenow 6 , Michael Christman 3 , Edwin K Silverman 4 , Nan M Laird 1 , Scott T Weiss 4 & Christoph Lange 1,4 The Human Genome Project and its spin-offs are making it increasingly feasible to determine the genetic basis of complex traits using genome-wide association studies. The statistical challenge of analyzing such studies stems from the severe multiple- comparison problem resulting from the analysis of thousands of SNPs. Our methodology for genome-wide family-based association studies, using single SNPs or haplotypes, can identify associations that achieve genome-wide significance. In relation to developing guidelines for our screening tools, we determined lower bounds for the estimated power to detect the gene underlying the disease-susceptibility locus, which hold regardless of the linkage disequilibrium structure present in the data. We also assessed the power of our approach in the presence of multiple disease-susceptibility loci. Our screening tools accommodate genomic control and use the concept of haplotype-tagging SNPs. Our methods use the entire sample and do not require separate screening and validation samples to establish genome-wide significance, as population-based designs do. In humans, SNPs are the most common type of genetic variation; eight million SNPs have already been documented and deposited in the dbSNP database 1,2 . Their dense distribution across the genome and their low mutation rate make them ideal markers for large-scale genome-wide association studies for common complex diseases 3 . The success of genome-wide association studies will depend on whether the increase in numbers of SNPs can be translated into increased statistical power or whether the positive effects of the increased number of SNPs will be diluted by the multiple-comparison problem. When thousands of SNPs are tested for association, the P values need to be adjusted for the number of tests computed to control type I error rates, which include the family-wise error rate and the false discovery rate (FDR). Multiple-testing procedures such as those proposed by Bonferroni 4 and Hochberg 5 adjust P values to control the family-wise error rate. They often generate unrealistically small significance levels for the individual tests, in part because the dependence between test statistics is ignored. Alternative multiple- testing approaches control the FDR 6,7 . Most procedures become more conservative as more tests are done. Ideally, SNP-reduction techniques are applied first, so that the number of association tests is diminished and the correction for multiple testing is less severe. To avoid biasing test results, the data used in the reduction process should differ from the data used for testing. For family data, it is possible to create two sources of information 8,9 using one sample. The basic idea is to estimate the genetic effect using a regression model that is statistically inde- pendent of the family-based analysis, using data from all families. The genetic effect estimate for each SNP is used to screen and select SNPs for association testing. The association testing on a much smaller set of SNPs uses family-based tests (FBATs), which are robust to population admixture. Here we report new strategies for genomic screening. We derived lower bounds for the estimated power of the screening method to detect a gene carrying a disease-susceptibility locus (DSL), regardless of the linkage disequilibrium (LD) structure. We show that population stratification has a minimal effect on power and illustrate the potential of the approach for genome-wide association studies using the soft- ware package PBAT 10,11 . RESULTS Simulation studies: power We assessed the power of the testing strategies using simulations. We used 291 SNPs in candidate genes from 651 trios in the Childhood Asthma Management Program (CAMP) Genetics Ancillary Study 12 , who were affected with mild to moderate asthma. We chose the interleukin gene IL10 on chromosome 1 as the DSL 13 . We selected each of the six typed SNPs in IL10 individually as the DSL and, for each offspring, simulated a trait value Y ij from the normal distribution with unit variance Y ij E N(aX ij ,1), where a denotes the genetic effect size and X ij denotes the observed marker score of the selected SNP in Published online 5 June 2005; doi:10.1038/ng1582 Departments of 1 Biostatistics and 2 Epidemiology, Harvard School of Public Health, Boston, Massachusetts 02115, USA. 3 Department of Genetics and Genomics, Boston University School of Medicine, Boston, Massachusetts 02115, USA. 4 Channing Laboratory, Harvard Medical School, Boston, Massachusetts 02115, USA. 5 Division of Genetics, Children’s Hospital, Boston, Massachusetts 02115, USA. 6 Genomics Collaboration Genotyping, Affymetrix, Inc., Santa Clara, California 95051, USA. Correspondence should be addressed to K.V.S. (kvanstee@hsph.harvard.edu). NATURE GENETICS VOLUME 37 [ NUMBER 7 [ JULY 2005 683 ARTICLES ARTICLES © 2005 Nature Publishing Group http://www.nature.com/naturegenetics