470 VOLUME 45 | NUMBER 5 | MAY 2013 | NATURE GENETICS
CORRESPONDENCE
The reason for our method’s relative ben-
efits in addressing this specific problem of
confounding due to spatial structure and
rare variants, despite its general-purpose
motivation, is quite simple. Our approach
5–7
is inspired by the fact that the SNPs used to
estimate similarity for the LMM can equiva-
lently be viewed as random covariates in
linear regression
8
. Given this equivalence, it
becomes clear that one should use particular
SNPs to estimate genetic similarity precisely
when they provide information about the
phenotype (either because of confound-
ing or because they are causal)
6
, rather than
using all available SNPs, as has been the
traditional LMM practice
4,9
. Following this
equivalence to its logical conclusion yielded
our new approach
6,7
, which can be succinctly
To the Editor:
A recent report by Mathieson and McVean
1
showed that confounding in genome-wide
association studies (GWAS) resulting from
spatially structured populations in conjunc-
tion with rare variants could not be corrected
by currently available statistical genetics
methods. In particular, when simulating that
the non-genetic cause of disease arose from a
sharply defined spatial region, genomic con-
trol
2
, principal-component analysis (PCA)
3
and linear mixed models (LMMs)
4,5
all failed
to correct for stratification, resulting in sys-
tematically inflated test statistics
1
. Although
several research avenues were proposed as
possible solutions to the problem
1
, none has
so far been shown to work. Additionally, it
was speculated that any method that could
correct for such confounding would require
fine-grained geographic information
1
.
As it turns out, our recently published
LMM algorithm, FaST-LMM-Select
5–7
, which
was not specifically designed to address the
particular problem of confounding due to
sharply defined spatial structure and rare
variants but rather to tackle general types
of confounding, does address this problem.
Furthermore, it does so without the need
for any geographic information. In fact, our
approach
5–7
yields non-inflated test statistics
and maintains maximal power to detect (spa-
tially unstructured) causal SNPs using only
SNP and phenotype data.
Specifically, to examine inflation, we
used the simulated data from Figure 3b of
Mathieson and McVean
1
, comprising 10 syn-
thetic data sets, each with 1,000 SNPs for 800
individuals. Population structure was simu-
lated using a lattice grid, and non-genetic
risk was sampled from sharply localized
geographic risk on the lattice. When com-
pared to genomic control, linear regression,
a traditional LMM
4
, PCA and rare-variant
versions of it, FaST-LMM-Select
5–7
was the
only method that did not lead to inflated test
statistics on these data (Fig. 1a and ref. 1).
To examine which method had the greatest
power, we augmented this simulated data
set with further simulation of 1,000 rare
causal SNPs generated independently from
binomial distributions with minor allele
frequencies (MAFs) drawn uniformly
between 0.1% and 4% (MAF of 4% was
the cutoff used in practice by Mathieson
and McVean). Next, a realized relation-
ship matrix (RRM)
8
was constructed from
the 1,000 causal SNPs. Finally, the genetic
signal was sampled from a zero-mean
Gaussian distribution with covariance set
to the RRM. Using these data, our approach
showed higher power to detect causal vari-
ants in these simulations compared to
linear regression or a traditional LMM
4
(Fig. 1b).
FaST-LMM-Select for addressing confounding from
spatial structure and rare variants
Figure 1 Comparison of three methods for genome-wide association analyses in the presence of
confounding due to spatial structure and rare variants. The methods compared were linear regression,
the traditional LMM
4
and FaST-LMM-Select
6
. (a) Quantile-quantile plots of the –log
10
(P values) of
10,000 SNPs for one phenotype, generated by Mathieson and McVean
1
, created using the sharply
defined, spatial non-genetic cause from their paper
1
. By construction, all SNP hypotheses are drawn
from the null distribution (no association). The solid green line denotes the theoretical null distribution
with a 99% confidence interval (dashed green lines). (b) Receiver operating characteristic curve for
the data in a augmented with data from 1,000 causal SNPs. Plots in a and b both show the average
over ten independent experiments. In all LMM results, the RRM was used as the measure of genetic
similarity, and parameters were fitted using FaST-LMM
5,6
. Note that, in b, one RRM was used to
generate a synthetic phenotype, whereas RRMs used to analyze the data were constructed using our
feature selection approach
6
on the 11,000 SNPs.
0
1
2
9
True positive rate
Observed –log
10
(P values)
Expected –log
10
(P values)
Linear regression
Traditional LMM
FaST-LMM-Select
False positive rate
8
7
6
5
4
3
0
0.02
0.10
0.08
0.06
0.04
0 1 2 3 4 0 0.02 0.04 0.06 0.08 0.10
a b
npg
© 2013 Nature America, Inc. All rights reserved.