612 Several lines of research are now converging towards an integrated understanding of mutational mechanisms and their evolutionary implications. Experimentally, crystal structures reveal the effect of sequence context on polymerase fidelity; large-scale sequencing projects generate vast amounts of sequence polymorphism data; and locus-specific databases are being constructed. Computationally, software and analytical tools have been developed to analyze mutational data, to identify mutational hot spots, and to compare the signatures of mutagenic agents. Addresses *Laboratory of Computational Genomics, The Rockefeller University, 1230 York Avenue, New York, New York 10021, USA; e-mail: mihaela@genomes.rockefeller.edu † The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA; e-mail: kepler@santafe.edu Correspondence: Mihaela Zavolan Current Opinion in Genetics & Development 2001, 11:612–615 0959-437X/01/$ — see front matter © 2001 Elsevier Science Ltd. All rights reserved. Abbreviations SC stochastic complexity SNP single nucleotide polymorphism Introduction The evolutionary paradigm postulates that the observable phenotypes of organisms result from selection providing directionality to an otherwise unbiased process of genetic variation. Data has been accumulating that suggests that some directionality may be intrinsic in the mutational mechanisms themselves [1], be they extrinsic (environmental mutagens) [2,3] or intrinsic (DNA polymerases) [4]. Adaptive evolution can exploit sequence-dependent mutational biases, for example, during phase variation in bacteria [5]. Mutations feature prominently in human pathology. Germline mutations can cause genetic disease and confer susceptibility to cancers, somatic mutations may initiate malignant transformation, and drug-resistance mutations impede the treatment of infectious disease. In all of these cases, mutational hot spots have been described [6,7 • ,8]. In this review, we discuss the following: first, recent develop- ments in the analysis of mutational spectra; second, recent studies in which a sequence-dependent effect on mutation rate has been observed; and third, recent studies that provide a link between the biochemistry of DNA replication and observed regularities in mutational spectra. Analysis of mutational patterns Mutational patterns are studied at different levels of granularity. Molecular evolution, for example, treats muta- tions as stochastic events — with some regularities such as transition/transversion bias — that can be used to recon- struct the evolutionary history of biological systems [9]. On a finer-grain level, locus-specific and core mutation databases are constructed [10,11,12 • ,13] to assist the diagnosis and prognosis of human disease. Finally, molecular studies reveal the effects of specific sequence contexts on poly- merase fidelity [4] and on DNA repair [14 •• ]. Ideally, we would like to bridge these levels and understand the mole- cular basis of the observed distribution of mutations in gene and protein sequences — the mutational spectrum [15,16]. Data for studying mutational spectra: a variety of resources Highly specialized, project-specific databases We have constructed, for example, a database of mutations that we inferred to have occurred in human processed pseudogenes, as well as a database of mutations that accu- mulated in non-selected regions of immunoglobulin genes during somatic hypermutation [17 •• ]. Similar databases have been developed and used by Rogozin et al. [18] for comparing the mutational spectrum of somatic hypermuta- tion with that of polymerase η. Locus-specific and core databases Locus-specific [12 • ,14 •• ] and core databases [13] are being developed and maintained for web access. Genetic polymorphism data Notable resources became available recently in the first draft of the human genome [19,20] and the single-nucleotide polymor- phism (SNP) data that has been generated as a by-product of large-scale sequencing projects [21,22]. SNP data are already being mined to uncover associations between specific loci in the human genome and disease traits but they could also be used to study mutational patterns in the human genome. At present, almost three million SNPs are deposited in the dbSNP database of the National Center of Biotechnology Information (http://www.ncbi.nlm.nih.gov/SNP/). Statistical methods The problem that we address is that of estimating the relative mutation rate, or ‘mutability’, for a given site in a DNA molecule and testing hypotheses about relationships among the rates at different sites. The factors that influ- ence the intrinsic mutability of a site are the identity (A, G, C or T) of the base at that site, the identities of the bases in the local neighborhood, which we shall refer to as the ‘microsequence context’, the potential for secondary struc- tures resulting from more distant bases, the position of the site relative to relevant markers in the molecule (distance from the centromere, from the telomere), and so on. Here we focus solely on the effect of the microsequence context on mutability. Other studies addressed the role of genomic heterogeneity on some aspects of mutability [14 •• ]. We start by creating a classification scheme such that every site represented in the data is assigned to exactly one class; Statistical inference of sequence-dependent mutation rates Mihaela Zavolan* and Thomas B Kepler †