JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 11, Number 4, 2004
© Mary Ann Liebert, Inc.
Pp. 727–733
Markov-Modulated Markov Chains and the Covarion
Process of Molecular Evolution
N. GALTIER
∗
and A. JEAN-MARIE
ABSTRACT
The covarion (or site specific rate variation, SSRV) process of biological sequence evolu-
tion is a process by which the evolutionary rate of a nucleotide/amino acid/codon position
can change in time. In this paper, we introduce time-continuous, space-discrete, Markov-
modulated Markov chains as a model for representing SSRV processes, generalizing exist-
ing theory to any model of rate change. We propose a fast algorithm for diagonalizing the
generator matrix of relevant Markov-modulated Markov processes. This algorithm makes
phylogeny likelihood calculation tractable even for a large number of rate classes and a large
number of states, so that SSRV models become applicable to amino acid or codon sequence
datasets. Using this algorithm, we investigate the accuracy of the discrete approximation
to the Gamma distribution of evolutionary rates, widely used in molecular phylogeny. We
show that a relatively large number of classes is required to achieve accurate approximation
of the exact likelihood when the number of analyzed sequences exceeds 20, both under the
SSRV and among site rate variation (ASRV) models.
Key words: Markov-modulated Markov models, sequence evolution, covarion, site-specific rate
variation, molecular phylogeny.
INTRODUCTION
✄
✂
✁
AU1
F
inite state Markov processes are widely used for modeling the evolution of genomic sequences.
An ancestral sequence is assumed to evolve down the branches of a phylogenetic tree according to
some time-continuous Markov process applying independently at every position of the sequence (henceforth
called sites). A site evolves in a discrete set of states of typical size m = 4 (nucleotides A, C, G, and
T), 20 (aminoacids), or 61 (nonstop codons). A change in the state space is called a substitution. Markov
models are useful for simulation purposes and for inferences about the history of a set of homologous
sequences. Parameters to be estimated typically include (i) the tree topology, (ii) the form of the substitution
matrix, (iii) substitution rates (or branch lengths), and (iv) ancestral sequences at internal nodes of the tree.
Inferences are usually done in the maximum likelihood framework.
An important feature of Markov models of sequence evolution is the way they represent the distribution
of substitution rate across sites. The earliest models assumed equal rates among sites. This is not realistic:
1
CNRS UMR 5171—GPIA, Université Montpellier 2-CC 63, Place E. Bataillon, 34095 Montpellier, France.
2
CNRS UMR 5506—LIRMM, Université Montpellier 2, 161 rue Ada, 34392 Montpellier, France.
727