Perspective A Common Language for Physical Mapping of the Human Genome MAYNARD OLSON, LEROY HOOD, CHARLES CANTOR, DAVID BOTSTEIN I N A REPORT ISSUED IN JANUARY 1988, THE NATIONAL Research Council (NRC) Committee on the Mapping and Sequencing of the Human Genome, on which the present authors served, recommended a staged mapping and sequencing project with early emphases on physical mapping of human DNA, mapping and sequencing of the genomes of model organisms, and the development of sequencing technology (1). As the Committee’s recommendations on physical mapping are beginning to be implemented on a substantial scale, it is timely to review these recommen- dations in the light of recent technical advances. In particular, the polymerase chain reaction (PCR) (2), a method that has only come into widespread use during the past 2 years, seems to us to offer a path toward a physical map that largely circumvents two problems that were prominent in the NRC Committee’s discussions. One of these was the difficulty of merging mapping data gathered by diverse methods in different laboratories into a consensus physical map. The second was the logistics and expense of managing the huge collections of cloned segments on which the mapping data would depend almost absolutely. By allowing short DNA sequences to be detected easily with high specificity and sensitivity, PCR makes practical the use of DNA sequence itself to define the basic landmarks on the physical map. We advocate the use of short tracts of single-copy DNA sequence (that is, sequences that occur only once in the genome) that can be easily recovered at any time by PCR as the landmarks that define position on the physical map. Construction of a physical map would then be seen as the determination of the order and spacing of DNA segments, each of which is identified uniquely by such a sequence. This will solve the problem of merging data from many sources, eliminate the need for large clone archives, and define a physical map that can evolve smoothly and naturally toward the ultimate goal of a complete DNA sequence of the human genome. Physical mapping: A hybrid technology. The physical map of the human genome envisioned by the NRC report as the precursor of sequencing was a hybrid of a “restriction map” and a “contig map.” Following the paradigm introduced by Nathans in the early 1970s for the case of SV4O, restriction maps show the order and distances between cleavage sites of site-specific restriction endonucleases (3). This type of mapping has been extended to much larger genomes, such as that of Escherichia coli, by exploiting the ability to separate very large restriction fragments with pulsed-field gel clectrophoresis (4). Contig maps represent the structure of contiguous regions of the genome by specifying the overlap relationships among a set of clones (5). Contig maps M. Olson is a professor of genetics, Department of Genetics, Washington University School of Medicine, Sr. Louis, MO 63110. L. Hood is director, N~F Science and are dependent on the continuing existence of a particular underlying clone collection; the generation and most uses of these maps depend on detailed analysis of individual clones. Hybrid maps draw on the complementary strengths of restriction maps and contig maps. Pure restriction maps are difficult to construct, primarily because the sites for the most suitable enzymes are distributed nonrandomlv and are sometimes blocked by the action of methylation systems that covalently modify DNA in vivo. Furthermore, restriction maps fail to address the need of most map users for ready access to the cloned DNA. Pure contig maps are also difficult to construct because these maps lose continuity at any point where clones are unavailable or overlap relationships are unclear. Indeed, extrapolation from past experience suggests that a contig map of a human chromosome of average size would be likely to contain between 200 and 1000 gaps. In a hybrid map, restriction maps based on the direct analysis of uncloned DNA—as well as data from other low-resolution mapping sources such as linkage mapping, cytogenetics and somatic cell genetics—are used to orient and align a series of contigs. In favorable cases, the resultant maps have good long-range continuity and are supported by clone collections that cover a high fraction of the mapped region. Sequence-tagged sites (STSs) will enhance the hybrid mapping strategy. The present proposal is not an alternative to the strategy described for mapping the human genome: the STS proposal redefines the end product, and is not itself a new mapping method. The idea would be to ‘translate” all types of mapping landmarks into the common language of STSs. Virtually any useful mapping method uses cloned DNA segments as landmarks, regardless of whether they are members of contigs, segments that contain an unusual restriction site, probes that detect genetically mapped DNA polymorphisms, or sequences that hybridize in situ to particular cvtogenetic bands. In practice, the translation of any of these examples to produce an STS would simply require sequencing a short tract of DNA from the clone that defines the landmark. In most instances, 200 to 500 bp of sequence define an STS that is operationally unique in the human genome (that is, can be specifically detected via PCR in the presence of all other genomic sequences). A PCR assay for an STS could be implemented simply by synthesizing two short (—20 nucleotides) oligodeoxynucleotides, chosen to be complementary to opposite strands and opposite ends of the sequence tract. A DNA sample would be tested for the presence of the sequence by testing its capacity to serve as a template for the in vitro synthesis of the tract in the presence of these two oligodeoxvnuclcotide “primers.” The procedure involves many automated cycles of DNA synthesis in a standard laboratory thermocycler; consequently, when the assay is positive, such large amounts of product are made that it can be detected without radioactive labeling. The overwhelming advantage of STSs over mapping landmarks defined in other ways is that the means of testing for the presence of a particular STS can be completely described as information in a database. No access to the biological materials that led to the definition or mapping of an STS is required by a scientist wishing to assay a DNA sample for its presence. An entry in the STS database would not only include raw sequence data on which a PCR-based STS assay could be based, but also would include detailed instructions for implementing a well-tested PCR assay. From such information alone, the assay could be implemented by any laboratory within 24 hours. Technology Center for Biotechnology. Division of Biology. California Institute of Technology, Pasadena, CA 91125. C. Cantor is the director, Human Genome Center, LI.. T n..a~.l... r’t aa’r,n r’~ ~ .~,-,.