Evolutionary Discrimination of Mammalian Conserved Non-Genic Sequences (CNGs) Emmanouil T. Dermitzakis, 1 *† Alexandre Reymond, 1 † Nathalie Scamuffa, 1 Catherine Ucla, 1 Ewen Kirkness, 2 Colette Rossier, 1 Stylianos E. Antonarakis 1 * Analysis of the human and mouse genomes identified an abundance of con- served non-genic sequences (CNGs). The significance and evolutionary depth of their conservation remain unanswered. We have quantified levels and pat- ternsofconservationof191CNGsofhumanchromosome21in14mammalian species. We found that CNGs are significantly more conserved than protein- codinggenesandnoncodingRNAS(ncRNAs)withinthemammalianclassfrom primates to monotremes to marsupials. The pattern of substitutions in CNGs differedfromthatseeninprotein-codingandncRNAgenesandresembledthat of protein-binding regions. About 0.3% to 1% of the human genome corre- sponds to a previously unknown class of extremely constrained CNGs shared among mammals. Until recently, the extent of nucleotide con- servation between human and other mam- malian species has been unclear. Small- scale analyses between human and mouse genomes suggested conservation outside of gene regions (1–6 ). Comparison with the draft of the mouse genome indicated that at least 5% of the human genome was under selective constraint; surprisingly, the ma- jority of these highly conserved sequences did not correspond to known genic se- quences, and experimental attempts to test the hypothesis that they are previously un- identified genes showed that this is un- likely (7–10). In addition, a method was recently described for the identification of primate-specific functional elements (11). Computational and mathematical efforts have attempted to distinguish the conserved regulatory portion of the genome from neu- trally evolving sites (12). However, no highly accurate methodology that can dis- criminate between different functional classes of highly conserved sequences has been developed. In this report, we analyze 220 sequences of the 2262 CNGs initially identified as highly conserved between human chromo- some 21 and mouse syntenic regions and presented no evidence for transcription po- tential (7 ). We subsequently compared their evolutionary properties with protein coding sequences (CODs) from past studies (13–15) and noncoding RNA gene sequenc- es (ncRNAs) obtained here. To perform polymerase chain reaction (PCR) from genomic DNA of green monkey, ring-tailed lemur, brush-tailed porcupine, rabbit, pig, cat, greater mouse-eared bat, white-toothed shrew, nine-banded armadillo, African ele- phant, tammar wallaby, and platypus, we designed oligonucleotides on CNG and ncRNA human sequences in highly con- served regions between human and mouse. The selection of ncRNAs has its basis in criteria of orthology and sufficient conser- vation to design primers. Only a small sub- set of known ncRNAs could be used be- cause of characteristics such as antisense to genes, small size, and unknown function. After PCR, we obtained at least one sequence from the other 12 species align- able to human and mouse for 191 out of 220 CNGs (87%) and 14 out of 16 ncRNAs (88%). The 19 nuclear protein-coding genes had been analyzed previously (15); we aligned 12 of the 44 original species (human, strepsirrhine, mouse, hystricid, rabbit, pig, cat, free-tailed bat, shrew, ar- madillo, elephant, and opossum). In that study, CODs were chosen to have 80 to 95% nucleotide identity between human and mouse (14 ), and they were selected from a larger set because a PCR product could be obtained from all 44 species (15). Therefore, these sequences are biased for high success of amplification in other spe- cies and high conservation, issues that be- come relevant below. Our analyses were performed in multiple alignments of 55,519 base pairs (bp) of CNGs, 17,028 bp of CODs, and 5599 bp of ncRNAs. To minimize biases from missing data resulting from PCR failure, we considered two CNG data sets, one of all 191 CNGs (CNG-all, fig. S1A) and another of 63 CNGs for which the sequences of at least 10 species were available, including at least one of armadillo, elephant, wallaby, or platypus (16 ). This second data set (CNG- high, for high species coverage; fig. S1B) is directly comparable to the CODs, which contain all 12 species’ sequences. With the use of the same criteria, we considered the complete data set for all 14 ncRNAs (ncRNA-all, fig. S1C) and a subset of 5 ncRNAs with high alignment coverage (ncRNA-high, fig. S1D). Both data sets of CNGs and ncRNAs (all and high) are used below to illustrate that the missing data do not influence the observed patterns. A large fraction of the 191 successfully amplified CNGs were highly conserved in multiple mammalian species (fig. S1; A, B, and E). Specifically, we could retrieve more than 43% of the orthologous sequenc- es from wallaby and/or platypus. High se- quence conservation was evident even in the presence of species-specific substitu- tion biases [e.g., A-T to G-C bias in mouse, porcupine, rabbit, and elephant (17 )] that increase the substitution rate, providing ad- ditional support for the significant role of CNGs. The divergence values of CNGs were much lower than those of CODs and ncRNAs for each species pair (Fig. 1 and table S1), illustrating strong selec- tive constraint. To quantify the levels of conservation, we estimated the amount of sequence di- vergence per unit of evolutionary time. For each of the 191 CNGs, 14 ncRNAs, and 57 CODs (18), we calculated sequence change per million years (D/my) assuming the phy- logenetic tree described in (15, 19). Ances- tral states were derived with maximum likelihood with the use of PAML3 (20), and inferred substitutions were placed on the branches of the phylogenetic tree to ac- count for all detectable substitution events. Divergence times were derived from (19). We calculated the sequence change for each tree branch and divided by the number of millions of years each branch covered. Figure 2A shows that CNG-all and CNG- high are significantly more constrained than CODs, ncRNA-all, and ncRNA-high. These observations are not a result of am- plification bias, because multiple species sequences for CODs were obtained with stricter criteria (see above) than CNGs and ncRNAs. The low D/my values of CNGs show that they are under a stronger selec- tive pressure than other functional genomic elements. To confirm that the higher sub- stitution rate of CODs is not an artifact of the selection of the CNGs, we performed a 1 Division of Medical Genetics and National Center of Competence in Research (NCCR) Frontiers in Genet- ics, University of Geneva Medical School and Univer- sity Hospitals, 1211 Geneva, Switzerland. 2 Institute for Genomic Research (TIGR), Rockville, MD 20850, USA. *To whom correspondence should be addressed. E- mail: Stylianos.Antonarakis@medecine.unige.ch (S.E.A.); Emmanouil.Dermitzakis@medecine.unige.ch (E.T.D.) †These authors contributed equally to this work. R EPORTS www.sciencemag.org SCIENCE VOL 302 7 NOVEMBER 2003 1033