Electrophoresis zyxwvutsrqponml 1998, zyxwvutsrqponm 19, 469-471 zyxwvutsrqponmlk Genome encyclopedias and zyxwv R. capsulatus genome project 469 zy Michael Fonstein Tatiana Nikolskaya Yakov Kogan Robert Haselkorn Department of Molecular Genetics and Cell Biology, The University of Chicago, Chicago, IL, USA Genome encyclopedias and their use for comparative analysis of Rhodobacter capsulatus strains This paper consists of two components: the use of gene encyclopedias in genomic studies and Rhodobacter capsulatus genome project. A survey of vec- tors used for encyclopedia construction includes a brief discussion of their relative advantages and limitations. Projects employing various methods of encyclopedia assembly including the comparison of restriction patterns, restric- tion maps, linking by hybridization, oligonucleotide fingerprinting, sequence tagged site (STS) fingerprinting and encyclopedias derived from genetic maps are listed and briefly described. The R. capsulatus SB 1003 genome project started with the construction of its cosmid encyclopedia, which comprises 192 cosmids covering the chromosome and the 134 kbp plasmid in strain SB 1003, with the exact map coordinates of each cosmid. In a pilot sequencing study, several cosrnids were individually subcloned using the vector M13mp 18 and merged into one 189 kbp contig. About 160 open reading frames (ORFs) iden- tified by the CodonUse program were subjected to similarity searches. The bio- logical functions of eighty ORFs could be assigned reliably using the WIT (what is there) genome investigation environment. Eighty percent of these recognizable ORFs were organized in functional clusters, which simplified assignment decisions and increased the strength of the predictions. A set of 26 genes for cobalamin biosynthesis, genes for polyhydroxyalkanoic acid metabo- lism, DNA replication and recombination, and DNA gyrase were among those identified. Recently, another 1.2 Mbp genome fragment of the Rhodobacter genome was sequenced using a slightly modified approach. These results toge- ther with some genome investigation tools, have been placed at our web site z (http://capsulapedia.uchicago.edu). The sequence of R. capsulatus is expected to be completed by summer 1998. A project to construct a systematic set of deletion strains of R. capsulatus in order to assign functions to unknown ORFs has been started. Preliminary data demonstrate the extreme conve- nience of the unique gene transfer agent (GTA) system to perform such work. 1 Introduction When the first physical maps of bacterial genomes were published in 1987 [l, 21, it inaugurated a new subdivision of molecular biology. This subdivision, now called genomics, is clearly defined by its specific subject zyxwvu - the study of integral genome structures, integral genome properties and the evolution of genomes - and by a specific set of tools, such as genome encyclopedias (ordered sets of overlapping clones containing the entire genome), PFGE-related methods, and genome sequencing. The current trend in genomics is character- ized by an explosion in the number of bacterial genome sequencing projects. More than forty such projects, listed at http://www.mcs.anl.gov/home/gaasterl/magpie.html, are in progress. Taking recent improvements in auto- mated fluorescent sequencing for granted, one can find most of the newer developments in the area of computa- tion, where different search algorithms have been merged in automated genome investigation environ- ments such as WIT (what is theve) developed by Over- Correspondence: Dr. M. Fonstein, Department of Molecular Genetics & Cell Biology, The University of Chicago, 920 East 58 Street, Chi- cago, IL 60637, USA (Tel: +312-702-1088; Fax: +312-702-3172; E-mail: fons@midway.uchicago.edu) Abbrevlatlons: MDO, minimal detectable overlap; ORF, open reading frame; STS, sequence tagged site; WIT, (what is there) Keywords: Genome encyclopedias I Rhodobacter capsulatus I Sequencing zyxwvutsrqpo 0 WILEY-VCH Verlag GmbH, 69451 Weinheim, 1998 beek (http://www.mcs.anl.gov/home/compbio/WIT/ wit.html), Magpie [3], or the software used for annota- tion of the Haemophilus genome [4]. The reduction in price of sequencing and computer tool development has changed the mentality of gene hunting. The identifica- tion of a few genes coding for enzymes with new desir- able properties or expressing essential targets for a new generation of therapeutics may pay for the entire genome sequencing. It also changed our views about genome encyclopedias. If one limits genomics just to genome sequencing, then a genome encyclopedia, once. considered as the most complete description of a chro- mosome, may be seen as an intermediate stage of genome characterization, unavoidable when studying large genomes and somewhat excessive for smaller ones. However, genome encyclopedias remain a vital source of material for functional analysis of genomes and can be an economical yet sufficient solution for structural studies. 2 Genome encyclopedias in bacterial genome Bacterial genomes range in size from 600 kbp for Myco- plasma genitalium [5] to 9.5 Mbp for Myxococcus xanthus [6]. Chromosomes of such sizes can be conveniently explored by a number of physical approaches. There are at least 90 such studies (summarized in [7, 8]), of which fifteen involved genome encyclopedia construc- tions. Most of the terms associated with genome ency- studies 0173-0835/98/0404-0469 $17.50+.50/0