Modeling amino acid substitution patterns in orthologous and paralogous genes Gavin C. Conant a, * , Gu¨nter P. Wagner b , Peter F. Stadler c a Smurfit Institute of Genetics, Trinity College, University of Dublin, Dublin 2, Ireland b Department of Ecology and Evolutionary Biology, Yale University, POB 208106, New Haven CT, 06520, USA c Lehrstuhl fu¨ r Bioinformatik, Institut fu¨ r Informatik, Universita¨ t Leipzig, Haertelstrasse 16-18, D-04107 Leipzig, Germany Received 1 February 2006; revised 12 June 2006; accepted 6 July 2006 Available online 26 July 2006 Abstract We study to what degree patterns of amino acid substitution vary between genes using two models of protein-coding gene evolution. The first divides the amino acids into groups, with one substitution rate for pairs of residues in the same group and a second for those in differing groups. Unlike previous applications of this model, the groups themselves are estimated from data by simulated annealing. The second model makes substitution rates a function of the physical and chemical similarity between two residues. Because we model the evolution of coding DNA sequences as opposed to protein sequences, artifacts arising from the differing numbers of nucleotide substi- tutions required to bring about various amino acid substitutions are avoided. Using 10 alignments of related sequences (five of orthol- ogous genes and five gene families), we do find differences in substitution patterns. We also find that, although patterns of amino acid substitution vary temporally within the history of a gene, variation is not greater in paralogous than in orthologous genes. Improved understanding of such gene-specific variation in substitution patterns may have implications for applications such as sequence alignment and phylogenetic inference. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Amino acid substitution; Evolutionary models; Protein evolution 1. Introduction One aspect of the evolution of protein-coding genes that is still imperfectly understood is at what frequency the var- ious amino acid residues exchange with each other over time. It is clear from the differing chemistries of the amino acids and from functional studies (for example the observa- tion that proteolytic enzyme trypsin can be converted to the substrate specificity of chymotrypsin by a single amino acid change, Hedstrom et al., 1994) that the different resi- dues are not generally evolutionarily equivalent (Thorne et al., 1996). It may thus be fruitful to pursue models of evolution incorporating varying rates of amino acid substi- tution. Such models could benefit at least two areas in the study of molecular evolution. First, correctly modeling substitution rates may be helpful for problems such as sequence alignment (where estimates of relative rates of amino acid substitution (RRAAS) 1 provide alignment scores) and phylogenetic inference. Second, understanding these patterns should help us understand the factors that shape and constrain protein evolution. The most straight-forward approach to creating these models is an empirical one: determining substitution fre- quencies using a representative sample of protein sequenc- es. This approach has been very successful when applied to the problem of sequence alignment. The two estimates in general use are PAM Dayhoff et al. (1972, 1978) and www.elsevier.com/locate/ympev Molecular Phylogenetics and Evolution 42 (2007) 298–307 1055-7903/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2006.07.006 * Corresponding author. Fax: +353 1 679 8558. E-mail address: conantg@tcd.ie (G.C. Conant). 1 Abbreviations used: RRAAS, relative rate of amino acid substitution; SG, similarity groups model; LCAP, linear combination of amino acid properties model.