GENOMICS Vol. 77, Numbers 1–2, September 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved. 0888-7543/01 $35.00 71 Article doi:10.1006/geno.2001.6620, available online at http://www.idealibrary.com on IDEAL INTRODUCTION Although a rough draft of the human genome sequence has been completed [1,2], there is still considerable uncertainty about how many protein coding genes this sequence contains, with three recent estimates putting the number at 35,000 [3], 28,000–34,000 [4], and 120,000 [5]. Two classes of computa- tional methods for aiding in gene identification are in wide- spread use: similarity-based methods such as BLAST [6], which are used to compare segments of genomic DNA with known genes, proteins, or expressed sequence tags (ESTs); and ab initio gene finding programs such as Genscan [7] and Fgenes [8], which predict gene structure on the basis of sta- tistical models of exon–intron and splice signal composition (but without using sequence similarity information) [reviewed in 9]. In addition, a number of experimental meth- ods have been developed, including EST sequencing, hybrid selection of cDNAs from various tissues to immobilized genomic DNA [10], serial analysis of gene expression [11,12], differential display [13], and exon trapping [14]. Of these, EST sequencing has made by far the largest contribution to the Assessment of the Total Number of Human Transcription Units Manjula Das, 1,* Christopher B. Burge, 3,* Eunhee Park, 1 Juliette Colinas, 1 and Jerry Pelletier 1,2,† 1 Department of Biochemistry and 2 McGill Cancer Center, McGill University, Rm 810, 3655 Drummond St., Montreal, Quebec, Canada, H3G 1Y6 3 Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Ave., 68-222, Cambridge, Massachusetts 02139, USA *These authors contributed equally to this work. † To whom correspondence and reprint requests should be addressed. Fax: (514) 398-2965. E-mail: Jerry@med.mcgill.ca. Variation in the estimates of the number of genes encoded by the human genome (28,000–120,000) attests to the difficulty of systematically identifying human genes. Sequencing of human chromosome 22 (Chr22) provided the first comprehensive, unbiased view of an entire human chromosome, and intensive analysis of this sequence identified 545 genes and 134 pseudogenes that had similarity or identity to known proteins and/or ESTs and which were listed in the gene annotation (http://www.sanger.ac.uk/HGP/Chr22). This analy- sis yielded an estimate of approximately 36,000 functional expressed genes in the human genome (and 9000 pseudogenes). However, a key uncertainty in this estimate was that hun- dreds of additional genes beyond those annotated in the Chr22 sequence are predicted by the gene prediction program Genscan, an unknown number of which might represent additional expressed genes. To determine what fraction of these “predicted novel genes” (PNGs) rep- resents expressed human genes, we used a sensitive RT-PCR assay to detect predicted tran- scripts in 17 tissues and one cell line. Our results indicate that at least 5000–9000 additional human genes which lack similarity to known genes or proteins exist in the human genome, increasing baseline gene estimates to ~ 41,000–45,000. Key words: human chromosome 22, gene prediction, expressed genes identification of novel human genes, providing an invaluable resource for human genetics. It is widely recognized, however, that EST databases con- tain a significant fraction of artifact sequences such as intronic or intergenic DNA. This fraction has been estimated to be on the order of 5–10% [15,16] and likely arises from DNA con- tamination and/or the presence of a significant amount (~ 7%) of heterogenous nuclear RNA (hnRNA) in RNA preparations used to generate EST libraries [17]. The rate of novel EST dis- covery has also fallen in recent years, from an estimated 10.6% in 1996 to 2.7% in 1998 [18], and even though over two mil- lion human ESTs are present in the public databases, recent estimates are that only about 80% of human genes are repre- sented in this set [3]. The remaining 20% presumably repre- sent tissue-limited or low-abundance transcripts that are unlikely to be sampled by EST sequencing, a method strongly biased toward highly expressed genes. Thus methods less biased by expression level are needed to complete our knowl- edge of the human gene set. The sequencing of human chromosome 22 (Chr22) [19] pro- vided the first comprehensive, unbiased view of an entire