The NIEHS Xenopus maternal EST project: interim analysis of the 13,879 ESTs from unfertilized eggs Perry J. Blackshear a , * , Wi S. Lai a , Judith M. Thorn a , Elizabeth A. Kennington a , Nickolas G. Staffa Jr. b , D. Troy Moore c , Gerard G. Bouffard d , Stephen M. Beckstrom-Sternberg d , Jeffrey W. Touchman d , Maria de Fatima Bonaldo e , M. Bento Soares e a Of®ce of Clinical Research and Laboratory of Signal Transduction, National Institute of Environmental Health Sciences, 111 Alexander Drive, Research Triangle Park, NC 27709, USA b Information Technology Support Services, National Institute of Environmental Health Sciences, 111 Alexander Drive, Research Triangle Park, NC 27709, USA c Research Genetics, Inc., 2130 Memorial Parkway, Huntsville, AL 35801, USA d NIH Intramural Sequencing Center and National Human Genome Research Institute, 8717 Grovemont Circle, Gaithersburg, MD 20877 e Departments of Pediatrics and Physiology and Biophysics, The University of Iowa, Iowa City, IA 52242, USA Received 21 October 2000; received in revised form 23 January 2001; accepted 9 February 2001 Received by J.A. Engler Abstract The sequencing of expressed sequence tags (ESTs) from Xenopus laevis has lagged behind efforts on many other comm organisms and man, partly because of the pseudotetraploid nature of the Xenopus genome. Nonetheless, large collections would be useful in gene discovery, oligonucleotide-based knockout studies, gene chip analyses of normal and perturbed development, mapping studies in the related diploid frog X. tropicalis, and for other reasons. We have created a normalized library of cDNAs from unfertilized Xenopus eggs. These cells contain all of the information necessary for the ®rst several cell divisions in the ear as much of the information needed for embryonic pattern formation and cell fate determination. To date, we have success 13,879 ESTs out of 16,607 attempts (83.6% success rate), with an average sequence read length of 508 bp. Using a fragment assembly program, these ESTs were assembled into 8,985 `contigs' comprised of up to 11 ESTs each. When these contigs were used available databases, 46.2% bore no relationship to protein or DNA sequences in the database at the signi®cance level of of a sample of 100 of the assembled contigs revealed that most (,87%) were comprised of two apparent allelic variants. Ex of 16 of the most prominent contigs showed that 12 exhibited some degree of zygotic expression. These ®ndings have implications for sequence-speci®c applications for Xenopus ESTs, particularly the use of allele-speci®c oligonucleotides for knockout stud hybridization techniques such as gene chip analysis, and the establishment of accurate nomenclature and databases for t Published by Elsevier Science B.V. All rights reserved. Keywords: Genomics; Allelic variants; Gene duplication; Sequence tags 1. Introduction Xenopuslaevis has beenconsidereda genetically awkward experimental animalbecause it `¼ displays all the features of an ancientetraploid species that is now completely diploidized' (Graf and Kobel, 1991). In practice, this means that many genes exist as allelic variants, expres- sing two mRNAs that generally differ by less than 10% at the nucleotide level within the protein coding region (Graf and Kobel, 1991). This is thought to occur in at least half o expressed genes, although the exact proportion is unknow Nonetheless, Xenopus laevis is a widely used model organ ism for experimental studiesof early development, cell cycle control, protein expression and aquatic toxicology. Xenopus was chosen recently by the NIH as one of ®ve important non-mammalian models of human development and disease. A workshop sponsored by the NIH was held o March 2, 2000, on the topic of `Identifying the genetic and genomic needs for Xenopus research' (see www.nih.gov/ science/models/xenopus/reports/xenopus_report.pdf). The Gene 267 (2001) 71±87 0378-1119/01/$ - see front matter q 2001 Published by Elsevier Science B.V. All rights reserved. PII: S 0 3 7 8 - 1 1 1 9 ( 0 1 ) 0 0 3 8 3 - 3 www.elsevier.com/locate/gene Abbreviations: EST, expressed sequence tags; b, base;bp,base pair; TAF, template activating factor; nt, nucleotide; UTR, untranslated region; kb, kilobase pairs; HBGF, heparin binding growth factor; G protein, guanine nucleotide regulatory protein; GCG, Genetics Computer Group * Corresponding author. Tel.: 11-919-541-4899; fax: 11-919-541-4571. E-mail address: black009@niehs.nih.gov (P.J. Blackshear).