470 VOLUME 45 | NUMBER 5 | MAY 2013 | NATURE GENETICS CORRESPONDENCE The reason for our method’s relative ben- efits in addressing this specific problem of confounding due to spatial structure and rare variants, despite its general-purpose motivation, is quite simple. Our approach 5–7 is inspired by the fact that the SNPs used to estimate similarity for the LMM can equiva- lently be viewed as random covariates in linear regression 8 . Given this equivalence, it becomes clear that one should use particular SNPs to estimate genetic similarity precisely when they provide information about the phenotype (either because of confound- ing or because they are causal) 6 , rather than using all available SNPs, as has been the traditional LMM practice 4,9 . Following this equivalence to its logical conclusion yielded our new approach 6,7 , which can be succinctly To the Editor: A recent report by Mathieson and McVean 1 showed that confounding in genome-wide association studies (GWAS) resulting from spatially structured populations in conjunc- tion with rare variants could not be corrected by currently available statistical genetics methods. In particular, when simulating that the non-genetic cause of disease arose from a sharply defined spatial region, genomic con- trol 2 , principal-component analysis (PCA) 3 and linear mixed models (LMMs) 4,5 all failed to correct for stratification, resulting in sys- tematically inflated test statistics 1 . Although several research avenues were proposed as possible solutions to the problem 1 , none has so far been shown to work. Additionally, it was speculated that any method that could correct for such confounding would require fine-grained geographic information 1 . As it turns out, our recently published LMM algorithm, FaST-LMM-Select 5–7 , which was not specifically designed to address the particular problem of confounding due to sharply defined spatial structure and rare variants but rather to tackle general types of confounding, does address this problem. Furthermore, it does so without the need for any geographic information. In fact, our approach 5–7 yields non-inflated test statistics and maintains maximal power to detect (spa- tially unstructured) causal SNPs using only SNP and phenotype data. Specifically, to examine inflation, we used the simulated data from Figure 3b of Mathieson and McVean 1 , comprising 10 syn- thetic data sets, each with 1,000 SNPs for 800 individuals. Population structure was simu- lated using a lattice grid, and non-genetic risk was sampled from sharply localized geographic risk on the lattice. When com- pared to genomic control, linear regression, a traditional LMM 4 , PCA and rare-variant versions of it, FaST-LMM-Select 5–7 was the only method that did not lead to inflated test statistics on these data (Fig. 1a and ref. 1). To examine which method had the greatest power, we augmented this simulated data set with further simulation of 1,000 rare causal SNPs generated independently from binomial distributions with minor allele frequencies (MAFs) drawn uniformly between 0.1% and 4% (MAF of 4% was the cutoff used in practice by Mathieson and McVean). Next, a realized relation- ship matrix (RRM) 8 was constructed from the 1,000 causal SNPs. Finally, the genetic signal was sampled from a zero-mean Gaussian distribution with covariance set to the RRM. Using these data, our approach showed higher power to detect causal vari- ants in these simulations compared to linear regression or a traditional LMM 4 (Fig. 1b). FaST-LMM-Select for addressing confounding from spatial structure and rare variants Figure 1 Comparison of three methods for genome-wide association analyses in the presence of confounding due to spatial structure and rare variants. The methods compared were linear regression, the traditional LMM 4 and FaST-LMM-Select 6 . (a) Quantile-quantile plots of the –log 10 (P values) of 10,000 SNPs for one phenotype, generated by Mathieson and McVean 1 , created using the sharply defined, spatial non-genetic cause from their paper 1 . By construction, all SNP hypotheses are drawn from the null distribution (no association). The solid green line denotes the theoretical null distribution with a 99% confidence interval (dashed green lines). (b) Receiver operating characteristic curve for the data in a augmented with data from 1,000 causal SNPs. Plots in a and b both show the average over ten independent experiments. In all LMM results, the RRM was used as the measure of genetic similarity, and parameters were fitted using FaST-LMM 5,6 . Note that, in b, one RRM was used to generate a synthetic phenotype, whereas RRMs used to analyze the data were constructed using our feature selection approach 6 on the 11,000 SNPs. 0 1 2 9 True positive rate Observed –log 10 (P values) Expected –log 10 (P values) Linear regression Traditional LMM FaST-LMM-Select False positive rate 8 7 6 5 4 3 0 0.02 0.10 0.08 0.06 0.04 0 1 2 3 4 0 0.02 0.04 0.06 0.08 0.10 a b npg © 2013 Nature America, Inc. All rights reserved.