Genomic screening and replication using the same data
set in family-based association testing
Kristel Van Steen
1
, Matthew B McQueen
2
, Alan Herbert
3
, Benjamin Raby
4
, Helen Lyon
4,5
, Dawn L DeMeo
4
,
Amy Murphy
1
, Jessica Su
2
, Soma Datta
4
, Carsten Rosenow
6
, Michael Christman
3
, Edwin K Silverman
4
,
Nan M Laird
1
, Scott T Weiss
4
& Christoph Lange
1,4
The Human Genome Project and its spin-offs are making it increasingly feasible to determine the genetic basis of complex traits
using genome-wide association studies. The statistical challenge of analyzing such studies stems from the severe multiple-
comparison problem resulting from the analysis of thousands of SNPs. Our methodology for genome-wide family-based
association studies, using single SNPs or haplotypes, can identify associations that achieve genome-wide significance. In relation
to developing guidelines for our screening tools, we determined lower bounds for the estimated power to detect the gene
underlying the disease-susceptibility locus, which hold regardless of the linkage disequilibrium structure present in the data. We
also assessed the power of our approach in the presence of multiple disease-susceptibility loci. Our screening tools accommodate
genomic control and use the concept of haplotype-tagging SNPs. Our methods use the entire sample and do not require separate
screening and validation samples to establish genome-wide significance, as population-based designs do.
In humans, SNPs are the most common type of genetic variation;
eight million SNPs have already been documented and deposited in
the dbSNP database
1,2
. Their dense distribution across the genome
and their low mutation rate make them ideal markers for large-scale
genome-wide association studies for common complex diseases
3
.
The success of genome-wide association studies will depend on
whether the increase in numbers of SNPs can be translated into
increased statistical power or whether the positive effects of the
increased number of SNPs will be diluted by the multiple-comparison
problem. When thousands of SNPs are tested for association, the P
values need to be adjusted for the number of tests computed to
control type I error rates, which include the family-wise error rate and
the false discovery rate (FDR). Multiple-testing procedures such as
those proposed by Bonferroni
4
and Hochberg
5
adjust P values to
control the family-wise error rate. They often generate unrealistically
small significance levels for the individual tests, in part because the
dependence between test statistics is ignored. Alternative multiple-
testing approaches control the FDR
6,7
. Most procedures become more
conservative as more tests are done.
Ideally, SNP-reduction techniques are applied first, so that the
number of association tests is diminished and the correction for
multiple testing is less severe. To avoid biasing test results, the data
used in the reduction process should differ from the data used for
testing. For family data, it is possible to create two sources of
information
8,9
using one sample. The basic idea is to estimate
the genetic effect using a regression model that is statistically inde-
pendent of the family-based analysis, using data from all families.
The genetic effect estimate for each SNP is used to screen and select
SNPs for association testing. The association testing on a much
smaller set of SNPs uses family-based tests (FBATs), which are robust
to population admixture.
Here we report new strategies for genomic screening. We derived
lower bounds for the estimated power of the screening method to
detect a gene carrying a disease-susceptibility locus (DSL), regardless
of the linkage disequilibrium (LD) structure. We show that population
stratification has a minimal effect on power and illustrate the potential
of the approach for genome-wide association studies using the soft-
ware package PBAT
10,11
.
RESULTS
Simulation studies: power
We assessed the power of the testing strategies using simulations. We
used 291 SNPs in candidate genes from 651 trios in the Childhood
Asthma Management Program (CAMP) Genetics Ancillary Study
12
,
who were affected with mild to moderate asthma. We chose the
interleukin gene IL10 on chromosome 1 as the DSL
13
. We selected
each of the six typed SNPs in IL10 individually as the DSL and, for
each offspring, simulated a trait value Y
ij
from the normal distribution
with unit variance Y
ij
E N(aX
ij
,1), where a denotes the genetic effect
size and X
ij
denotes the observed marker score of the selected SNP in
Published online 5 June 2005; doi:10.1038/ng1582
Departments of
1
Biostatistics and
2
Epidemiology, Harvard School of Public Health, Boston, Massachusetts 02115, USA.
3
Department of Genetics and Genomics,
Boston University School of Medicine, Boston, Massachusetts 02115, USA.
4
Channing Laboratory, Harvard Medical School, Boston, Massachusetts 02115, USA.
5
Division of Genetics, Children’s Hospital, Boston, Massachusetts 02115, USA.
6
Genomics Collaboration Genotyping, Affymetrix, Inc., Santa Clara, California 95051,
USA. Correspondence should be addressed to K.V.S. (kvanstee@hsph.harvard.edu).
NATURE GENETICS VOLUME 37 [ NUMBER 7 [ JULY 2005 683
ARTICLES
ARTICLES
© 2005 Nature Publishing Group http://www.nature.com/naturegenetics