Regions of Low Single-Nucleotide Polymorphism Incidence in Human and Orangutan Xq: Deserts and Recent Coalescences Raymond D. Miller, 1 Patricia Taillon-Miller, and Pui-Yan Kwok Division of Dermatology, Washington University School of Medicine, St. Louis, Missouri 63110 Received August 2, 2000; accepted October 12, 2000 While scanning for single-nucleotide polymor- phisms (SNPs) in the human Xq25– q28 region of CEPH families, we found six long “deserts” of low SNP inci- dence representing 28% of the investigated genome. One was 1.66 Mb in length. To determine whether these SNP deserts were due to reduced input of muta- tions or to recent coalescent events such as bottle- necks or selective sweeps, comparative sequence was determined from a female orangutan. The mean diver- gence was 2.9% and was not reduced in deserts com- pared with nondesert regions. Thus, the best explana- tion for the SNP deserts is recent coalescent events in humans. These events are the cause of substantial variation in human noncoding SNP incidence. In ad- dition, the mutational spectrum in humans and oran- gutans was estimated as 63% AG (and CT), 17% AC (and GT), 8% CG, 4% AT, and 8% insertion/deletions. The average lifetime of a SNP destined to become fixed for a new allele between these species was estimated as 284,000 years. © 2001 Academic Press INTRODUCTION Single-nucleotide polymorphisms (SNPs), the pre- dominant genetic variation within the human species, are likely to be responsible for many phenotypic differ- ences between individuals. Existing human SNPs, cre- ated by mutation, undoubtedly represent a small sur- vivor fraction determined and geographically apportioned by migration, by chance including demo- graphic events, and potentially by selection (Chakra- varti, 1999). Within protein coding regions, measures of SNP variation (including SNP incidence and nucle- otide diversity, the probability that a homologous nu- cleotide in two sequences is not the same) are reduced at sites causing coding changes compared with silent sites. This pattern has been interpreted as reflecting selection against deleterious alleles (Cargill et al., 1999; Chakravarti, 1999; Halushka et al., 1999). In noncoding regions it is conceptually attractive as a working hypothesis to consider that human nucleo- tide diversity is a constant, and it has been estimated as 0.063%, an order of magnitude less than in Drosoph- ila melanogaster (Nachman et al., 1998). However, these authors further observed that noncoding nucleo- tide diversity is not a constant in their study, ranging among locations from no differences to 0.184%, and they found a weak positive correlation between nucle- otide diversity and the local recombination rate (Nach- man et al., 1998). Other studies in humans have also detected a range of values for noncoding nucleotide diversity (Chakravarti, 1999; Nachman and Crowell, 2000; Taillon-Miller et al., 1998). In strains of the lab- oratory mouse, more STSs contain no SNPs and more STSs contain multiple SNPs than would be expected based on a Poisson distribution (Lindblad-Toh et al., 2000). The question arises: in human noncoding re- gions, why does diversity of diversity exist? In consonance with ideas of evolutionary genetics, the incidence of noncoding SNPs in a region is the result of three factors: first, the input rate of mutations forming SNPs; second, the removal rate of individual SNPs; and third, the time since the region had a single ancestral sequence. Since factor 2, the removal rate (fixation) of individual SNPs, is not a regional phenom- enon, the explanation of differences in regional inci- dence of SNPs must depend upon factor 1, the input rate of mutations, or factor 3, the time since the region had a single ancestral sequence. One reason that factor 1, the input rate of mutations, might differ between two regions of the genome is that the underlying local mutation rate could be different due to base composi- tion and/or the effect of unknown higher order struc- tures on the access of DNA repair molecules. For ex- ample, by the classical neutral model, n = 4Nu + 1, where n is the effective number of alleles (a measure of variation), N is the effective population size, and u is the mutation rate (Kimura, 1983). If u, the mutation Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under Accession Nos. AF280892– AF280938 (orangutan sequence) and G65815–G65814 (additional STSs). Also see accession numbers for STSs and SNPs in Taillon- Miller and Kwok (2000, Genomics 65, 195–202). 1 To whom correspondence should be addressed at Division of Dermatology, Box 8123, Washington University School of Medicine, 660 S. Euclid Avenue, St. Louis, MO 63110. Telephone: (314) 362- 8199. Fax: (314) 362-8159. E-mail: rmiller@psts.wustl.edu. Genomics 71, 78 – 88 (2001) doi:10.1006/geno.2000.6417, available online at http://www.idealibrary.com on 78 0888-7543/01 $35.00 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved.