COMMENT 22 Hardwick, K.G. et al. (1996) Science 273,953-956 23 Roberts, R.T., Farr,K.A. and Hoyt, M.A. (1994) Mol. Cell. Biol. 14,8282-8291 14 Weiss, E. andWiney, M. (1996) J. Cell Biol. 132, 111-123 25 NicMas, R.B. (1997) Science 275, 632-637 26 Chen, R.H., Waters, J.C., Salmon, E.D. andMurray,A.W. (1996) Science 274, 242-246 17 Li, Y. andBenezra, R.(1996) Science 274, 246-248 28 Straight, A.F., Marshall, W.F., Sedat, J.W. and Murray,A.W. (1997) Science 277, 574578 19 Toyn, J.H.,Johnson, A.L. and Johnston, L.H. (1995) Mol. Cell. Biol. 15, 5312-5321 20 Piatti,S.,Lengauer, C. and Nasmyth, K. (19951 EMBOJ. 14, 3788-3799 21 Tavormina, P.A., Wang, Y.C. and Burke, D.J.(1997) Mol. Cell. Biol. 17, 3315-3322 22 Sluder, G., Miller, F.J., Thompson, E.A. and Wolf, D.E. (1994)J.Cell Biol. 126,189-198 23 Zhang, D. andNickIas, R.B. (1996) Nature382, 466-468 24 Taylor,S.S. andMcKeon, F. (19977) Cell 89, 727-735 25 Rieder, CL. et al. (1997) Proc.Natl. Acad. Sci. U. S. A. 94, 5107-5112 26 MinshuiI, J.,Sun, H., Tanks. N.K.and Murray, A.W. (1994) Celj79, 475-486 27 Takenaka, K., Gotoh,Y. and Nishida, E. (1997)J. Cell Biol. 136, 1091-1097 28 Wang, X.M., Zhai, Y. and Ferrell, J.E. (1997) J. Cell Biol. 137, 433-443 29 Geiser, J.R et al. (1997) Mol. Biol. Cell 8, 1035-1050 30 Schwab, M., Lutum, AS. and Seufen, W. (1997) Cell%, 683-693 32 He, X., Patterson, T.E. andSazer, S. (1997) Proc.Natl. Acad. Sci. U.S. A. 94,7965-7970 32 Fankhauser, C.,Marks, J., Reymond, A. andSimanis, V. (1993) EMBO J. 12, 2697-704 33 Murone, M. and Simanis, V. (1996) EMBO J, 15,6605-6616 Expressed sequence tags - ESTablishing bridges between genomes MARCO A. MARRA,LADEANA HILLIER AND ROBERT H. WATERSTON mmarraQalu.wustl.edu l lhillier@alu.wustl.edu l bwaterst@alu.wustl.edu WASHINGTOK UNNERSIRGENOME SEQUENCING CENTER, 4444 FOREST PARK BOLIEVARD, ST LOUIS. MO 63108, USA. On 1 August 1997, USVice President Gore officially announced the creation of a new World Wide Web database’ which aimsto provide powerful new resources to researchers investigating the molecular basis of cancer. The publicly accessible website is serv- ing to disseminate data generatedby the National CancerInstitute’sCancer Genome Anatomy Project (CGAP), which is intended to drive the detailed molecular characterization of pre- cancerousand malignant cells.A first objective in the CGAP is the compi- lation of a catalogue, or ‘index’, of genes that areexpressed in these cells. But what sort of gene-identification data are being collected to form the backbone of this gene-expression index? The answer to this question is expressed sequence tags (ESTS), which are DNA sequences read from the ends of cDNA molecules. Se- quences are to be generated at a rate of 10000 per week from a large num- ber of cDNA libraries (for a list of libraries seeRef. 1). EST projects past and present The CGAP has joined a growing list of large-scale efforts that have employed EST approaches for gene identification purposes.EST projects have their roots in the early 198Os, when it was recognized that short stretches of DNA sequence from cDNAs could be used to identify genes2. Earlier this decade, scientists at The Institute for Genomic Research (TIGR)3were among the first to gen- erate ESTdata on a massive scale4vs. Although accessto these data was initially subject to restrictions, TIGR hasannounced6 the public release of more than 100 000 ESTs to the NCBI- maintained database ‘dbEST’7. In addition, 63 000 ‘tentative humancon- sensus sequences’ (THCS), each cre- ated by assembling available ESTs into longer stretches of sequence, will be available at the TIGR website. Another large project,The Genexpress Index*, has contributed more than 25 000 predominantly brain- and muscle-derivedsequences to dbEST. Among the largest projects con- ducted entirely in the public domain include an effort funded by Merck and Company, which has deposited more than 528000 human ESTsinto dbEST9,10, and one funded by the Howard Hughes Medical Institute, which has produced 216 000 mouse ESTs (Ref. 11) on the way to a target TIG JANUARY 1998 VOL. 14 No. 1 of 300 000 sequences by the end of 1998. A hallmark of these endeavours, carried out by a collaboration be- tween Washington University Genome Sequencing Center and members of the IMAGE (Integrated Molecular Analysis of Gene Expression) con- sortiuml2,‘3, has been the rapid deposition of the sequences into the public domain and the concomitant availability of the sequence-tagged cDNA clones from several distrib- utors (Fig. 1). These features have ensured that the data are broadly accessible and, therefore, immedi- ately useful in a wide variety of researchcontexts. ESTprojects are being conducted on a diverse collection of organisms. dbEST contains over 1.2 million se- quences, generated from the proj- ects described above and other efforts focusedon Caenorhabditis ele- gans’“l6 and other nematodesl7Js, the plants Arabidopsis thaliana19,20 and rice21, Drosophila melano- gaster22, and the protozoan parasite Toxoplasma gondiils. Theseprojectshave demonstrated that, with the necessary laboratory infrastructure, inexpensive ESTscan be generated in large numbers. Copyright 0 1998 Elsevier Saence Ltd. All rights reserved 01689525/98/S19.00 4 PII. SOI@9525(97)01355-3