612
Several lines of research are now converging towards an
integrated understanding of mutational mechanisms and their
evolutionary implications. Experimentally, crystal structures reveal
the effect of sequence context on polymerase fidelity; large-scale
sequencing projects generate vast amounts of sequence
polymorphism data; and locus-specific databases are being
constructed. Computationally, software and analytical tools have
been developed to analyze mutational data, to identify mutational
hot spots, and to compare the signatures of mutagenic agents.
Addresses
*Laboratory of Computational Genomics, The Rockefeller University,
1230 York Avenue, New York, New York 10021, USA;
e-mail: mihaela@genomes.rockefeller.edu
†
The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico
87501, USA; e-mail: kepler@santafe.edu
Correspondence: Mihaela Zavolan
Current Opinion in Genetics & Development 2001, 11:612–615
0959-437X/01/$ — see front matter
© 2001 Elsevier Science Ltd. All rights reserved.
Abbreviations
SC stochastic complexity
SNP single nucleotide polymorphism
Introduction
The evolutionary paradigm postulates that the observable
phenotypes of organisms result from selection providing
directionality to an otherwise unbiased process of genetic
variation. Data has been accumulating that suggests that
some directionality may be intrinsic in the mutational
mechanisms themselves [1], be they extrinsic (environmental
mutagens) [2,3] or intrinsic (DNA polymerases) [4]. Adaptive
evolution can exploit sequence-dependent mutational
biases, for example, during phase variation in bacteria [5].
Mutations feature prominently in human pathology.
Germline mutations can cause genetic disease and confer
susceptibility to cancers, somatic mutations may initiate
malignant transformation, and drug-resistance mutations
impede the treatment of infectious disease. In all of these
cases, mutational hot spots have been described [6,7
•
,8]. In
this review, we discuss the following: first, recent develop-
ments in the analysis of mutational spectra; second, recent
studies in which a sequence-dependent effect on mutation
rate has been observed; and third, recent studies that
provide a link between the biochemistry of DNA
replication and observed regularities in mutational spectra.
Analysis of mutational patterns
Mutational patterns are studied at different levels of
granularity. Molecular evolution, for example, treats muta-
tions as stochastic events — with some regularities such as
transition/transversion bias — that can be used to recon-
struct the evolutionary history of biological systems [9]. On
a finer-grain level, locus-specific and core mutation databases
are constructed [10,11,12
•
,13] to assist the diagnosis and
prognosis of human disease. Finally, molecular studies
reveal the effects of specific sequence contexts on poly-
merase fidelity [4] and on DNA repair [14
••
]. Ideally, we
would like to bridge these levels and understand the mole-
cular basis of the observed distribution of mutations in gene
and protein sequences — the mutational spectrum [15,16].
Data for studying mutational spectra: a variety of resources
Highly specialized, project-specific databases
We have constructed, for example, a database of mutations
that we inferred to have occurred in human processed
pseudogenes, as well as a database of mutations that accu-
mulated in non-selected regions of immunoglobulin genes
during somatic hypermutation [17
••
]. Similar databases
have been developed and used by Rogozin et al. [18] for
comparing the mutational spectrum of somatic hypermuta-
tion with that of polymerase η.
Locus-specific and core databases
Locus-specific [12
•
,14
••
] and core databases [13] are being
developed and maintained for web access.
Genetic polymorphism data
Notable resources became available recently in the first draft of
the human genome [19,20] and the single-nucleotide polymor-
phism (SNP) data that has been generated as a by-product of
large-scale sequencing projects [21,22]. SNP data are already
being mined to uncover associations between specific loci in
the human genome and disease traits but they could also be
used to study mutational patterns in the human genome. At
present, almost three million SNPs are deposited in the dbSNP
database of the National Center of Biotechnology Information
(http://www.ncbi.nlm.nih.gov/SNP/).
Statistical methods
The problem that we address is that of estimating the
relative mutation rate, or ‘mutability’, for a given site in a
DNA molecule and testing hypotheses about relationships
among the rates at different sites. The factors that influ-
ence the intrinsic mutability of a site are the identity (A, G,
C or T) of the base at that site, the identities of the bases
in the local neighborhood, which we shall refer to as the
‘microsequence context’, the potential for secondary struc-
tures resulting from more distant bases, the position of the
site relative to relevant markers in the molecule (distance
from the centromere, from the telomere), and so on. Here
we focus solely on the effect of the microsequence context
on mutability. Other studies addressed the role of genomic
heterogeneity on some aspects of mutability [14
••
].
We start by creating a classification scheme such that every
site represented in the data is assigned to exactly one class;
Statistical inference of sequence-dependent mutation rates
Mihaela Zavolan* and Thomas B Kepler
†