ARTICLE Population Analysis of Large Copy Number Variants and Hotspots of Human Genetic Disease Andy Itsara, 1,7 Gregory M. Cooper, 1,7 Carl Baker, 1 Santhosh Girirajan, 1 Jun Li, 2 Devin Absher, 3 Ronald M. Krauss, 4 Richard M. Myers, 3 Paul M. Ridker, 5 Daniel I. Chasman, 5 Heather Mefford, 1 Phyllis Ying, 1 Deborah A. Nickerson, 1 and Evan E. Eichler 1,6, * Copy number variants (CNVs) contribute to human genetic and phenotypic diversity. However, the distribution of larger CNVs in the general population remains largely unexplored. We identify large variants in ~2500 individuals by using Illumina SNP data, with an emphasis on ‘‘hotspots’’ prone to recurrent mutations. We find variants larger than 500 kb in 5%–10% of individuals and variants greater than 1 Mb in 1%–2%. In contrast to previous studies, we find limited evidence for stratification of CNVs in geographically distinct human populations. Importantly, our sample size permits a robust distinction between truly rare and polymorphic but low-frequency copy number variation. We find that a significant fraction of individual CNVs larger than 100 kb are rare and that both gene density and size are strongly anticorrelated with allele frequency. Thus, although large CNVs commonly exist in normal individuals, which suggests that size alone can not be used as a predictor of pathogenicity, such variation is generally deleterious. Considering these observations, we combine our data with published CNVs from more than 12,000 individuals contrasting control and neurological disease collections. This analysis identifies known disease loci and highlights additional CNVs (e.g., 3q29, 16p12, and 15q25.2) for further investigation. This study provides one of the first analyses of large, rare (0.1%–1%) CNVs in the general population, with insights relevant to future analyses of genetic disease. Introduction Copy number variants (CNVs) are insertions, deletions, and duplications of genomic sequence ranging from a kilo- base to multiple megabasepairs in length and are major contributors to human genetic diversity. 1–5 CNVs are known to influence both normal and disease variation, 6 and there are at least two distinct, but nonexclusive, models of CNV-phenotype associations. One model involves common copy number polymorphisms (CNPs) often with multiple allelic states defined by variation in copy number and/or genomic structure. CNP genes are en- riched for biological functions associated with drug response, immunity, and sensory perception, among others. 7–9 Under this model, common variants that change the dosage of genes or other functional elements influence phenotypes such as HIV-1/AIDS susceptibility (MIM 609423), 10 Crohn’s disease (MIM 266600), 11 and glomeru- lonephritis in systemic lupus erythematosus (MIM 152700). 12 A second model involves rare CNVs that delete or dupli- cate typically larger genomic segments and exist in fewer allelic states (i.e., hemizygous or trisomic). These CNVs are highly penetrant and short-lived in the population, either occurring de novo or persisting for only a few gener- ations within a pedigree. A large fraction of these variants arise by nonallelic homologous recombination (NAHR) between segmental duplications or low-copy repeats. Orig- inally defined as genomic disorders, 13 there are now dozens of clinically recognized syndromes, associated with cognitive deficits, diabetes, epilepsy, and other traits, that result from recurrent NAHR-mediated events. In some cases, variants that overlap but are distinct lead to a similar syndrome, 13–17 whereas in other cases the phenotype is more variable. 18–21 Additionally, recent studies of autism (MIM 209850) and schizophrenia (MIM 181500) found a bulk excess of rare CNVs in affected individuals relative to those unaffected, suggesting that some of the rare vari- ants present in affected individuals are pathogenic. 22–25 Thus, although only a limited number of rare variants have been definitively associated with disease, it is likely that a large fraction of CNV-trait associations conform to a ‘‘common disease-rare variant’’ hypothesis, in contrast to the ‘‘common disease-common variant’’ hypothesis that underpins most genome-wide association studies. Understanding the extent to which rare CNVs influence phenotypes requires deep analyses in both disease and normal populations. Previous studies of copy-number vari- ation in human populations have largely been restricted to hundreds of individuals and therefore unable to distin- guish variants that are truly rare (<1%) from those variants that are polymorphic but at low frequency. 2,26 Recent studies have begun to expand to substantially larger sample collections, but focused on analyses of specific diseases rather than the broader genomic effects of large, rare CNVs. 1–4,23,25 Here, we analyze copy number variation 1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, WA 98195, USA; 2 Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA; 3 HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA; 4 Children’s Hospital Oakland Research Insti- tute, Oakland, CA 94609, USA; 5 Center for Cardiovascular Disease Prevention, Donald W. Reynolds Center for Cardiovascular Research, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA; 6 Howard Hughes Medical Institute 7 These authors contributed equally to this work *Correspondence: eee@gs.washington.edu DOI 10.1016/j.ajhg.2008.12.014. ª2009 by The American Society of Human Genetics. All rights reserved. 148 The American Journal of Human Genetics 84, 148–161, February 13, 2009