JOURNAL OF COMPUTATIONAL BIOLOGY Volume 11, Number 4, 2004 © Mary Ann Liebert, Inc. Pp. 727–733 Markov-Modulated Markov Chains and the Covarion Process of Molecular Evolution N. GALTIER ∗ and A. JEAN-MARIE ABSTRACT The covarion (or site specific rate variation, SSRV) process of biological sequence evolu- tion is a process by which the evolutionary rate of a nucleotide/amino acid/codon position can change in time. In this paper, we introduce time-continuous, space-discrete, Markov- modulated Markov chains as a model for representing SSRV processes, generalizing exist- ing theory to any model of rate change. We propose a fast algorithm for diagonalizing the generator matrix of relevant Markov-modulated Markov processes. This algorithm makes phylogeny likelihood calculation tractable even for a large number of rate classes and a large number of states, so that SSRV models become applicable to amino acid or codon sequence datasets. Using this algorithm, we investigate the accuracy of the discrete approximation to the Gamma distribution of evolutionary rates, widely used in molecular phylogeny. We show that a relatively large number of classes is required to achieve accurate approximation of the exact likelihood when the number of analyzed sequences exceeds 20, both under the SSRV and among site rate variation (ASRV) models. Key words: Markov-modulated Markov models, sequence evolution, covarion, site-specific rate variation, molecular phylogeny. INTRODUCTION ✄ ✂  ✁ AU1 F inite state Markov processes are widely used for modeling the evolution of genomic sequences. An ancestral sequence is assumed to evolve down the branches of a phylogenetic tree according to some time-continuous Markov process applying independently at every position of the sequence (henceforth called sites). A site evolves in a discrete set of states of typical size m = 4 (nucleotides A, C, G, and T), 20 (aminoacids), or 61 (nonstop codons). A change in the state space is called a substitution. Markov models are useful for simulation purposes and for inferences about the history of a set of homologous sequences. Parameters to be estimated typically include (i) the tree topology, (ii) the form of the substitution matrix, (iii) substitution rates (or branch lengths), and (iv) ancestral sequences at internal nodes of the tree. Inferences are usually done in the maximum likelihood framework. An important feature of Markov models of sequence evolution is the way they represent the distribution of substitution rate across sites. The earliest models assumed equal rates among sites. This is not realistic: 1 CNRS UMR 5171—GPIA, Université Montpellier 2-CC 63, Place E. Bataillon, 34095 Montpellier, France. 2 CNRS UMR 5506—LIRMM, Université Montpellier 2, 161 rue Ada, 34392 Montpellier, France. 727