The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation Li-San Wang, Jim Leebens-Mack, P. Kerr Wall, Kevin Beckmann, Claude W. dePamphilis, and Tandy Warnow Abstract—Multiple sequence alignment is typically the first step in estimating phylogenetic trees, with the assumption being that as alignments improve, so will phylogenetic reconstructions. Over the last decade or so, new multiple sequence alignment methods have been developed to improve comparative analyses of protein structure, but these new methods have not been typically used in phylogenetic analyses. In this paper, we report on a simulation study that we performed to evaluate the consequences of using these new multiple sequence alignment methods in terms of the resultant phylogenetic reconstruction. We find that while alignment accuracy is positively correlated with phylogenetic accuracy, the amount of improvement in phylogenetic estimation that results from an improved alignment can range from quite small to substantial. We observe that phylogenetic accuracy is most highly correlated with alignment accuracy when sequences are most difficult to align, and that variation in alignment accuracy can have little impact on phylogenetic accuracy when alignment error rates are generally low. We discuss these observations and implications for future work. Index Terms—Simulation, biology and genetics, multiple protein sequence alignment, phylogeny reconstruction. Ç 1 INTRODUCTION M ULTIPLE sequence alignment (MSA) is an important computational problem that is fundamental to all sequence-based comparative analyses. In particular, appli- cations of alignment estimation to problems in protein sequence analysis (structure, function, and subfamily identification) have led to many new protein alignment methods. Structural alignment databases, such as STAMP [1], SCOP [2], [3], HOMSTRAND [4], BaliBASE [5], [6], [7], and PREFAB [8], have been compiled to serve as bench- marks for developing and refining MSA methods. Refer- ence alignments are also commonly derived through simulations of sequence evolution [9], [10], [11], [12]. Comparative studies based upon these benchmarks and simulations have concluded that many of the newer MSA methods, especially those designed for protein alignment (MAFFT [13], ProbCons [14], T-Coffee [15], Di-Align [16], Opal [17], Prank [18], AMAP [19], ProbAlign [20], and others), are substantially better than earlier ones including Clustal [21], in that they achieve higher alignment accuracy scores on these benchmarks. Among these newer methods, MAFFT and ProbCons are consistently among the best performing [22], [23], [24], [25]. These studies have implications for phylogenetic ana- lyses, as the common wisdom is that alignment strategy can impact phylogeny estimation. For example, Morrison and Ellis [26] found that the choice of multiple sequence alignment method had a greater impact on the resultant phylogeny than the choice of phylogeny estimation method, and Wong et al. [27] showed that with high sequence divergence, different alignment strategies can produce alignments that yield conflicting gene trees. Studies of the impact of alignment estimation on phylogeny estimation have also been performed using simulations of sequence evolution that include insertions, deletions (jointly referred to as “indels”) as well as substitu- tions [20], [28], [29], [30]. These studies have generally (but not always) shown that alignment estimation can have an impact on phylogeny estimation, and have suggested that under some circumstances, alignment accuracy correlates with phylogenetic accuracy. Unfortunately, each of these studies evaluated a limited set of MSA methods, phylogeny reconstruction methods, and models of sequence evolution. For example, Roshan and Livesay [20], [31] aligned simu- lated DNA sequences using six MSA methods (ClustalW [21], MUSCLE [8], and four methods developed the authors that use MUSCLE in different ways) and performed maximum parsimony analsyes. Their main objective was to see if their new MSA methods provided an improvement in phylogenetic estimation when followed by a maximum parsimony analysis. Their simulation study (performed on large model trees containing from 100 to 400 taxa) showed that their new methods provided a very modest improve- ment (about one percent) in phylogenetic accuracy, as measured using the Robinson-Foulds score [32], over the best of the other alignment methods. They also observed that 1108 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011 . L.-S. Wang is with the Department of Pathology and Laboratory Medicine and Penn Center for Bioinformatics, 1424 Blockley Hall, 423 Guardian Drive, University of Pennsylvania, Philadelphia, PA 19104. E-mail: lswang@mail.med.upenn.edu. . J. Leebens-Mack is with the Department of Plant Biology, University of Georgia, Athens, GA 30602. E-mail: jleebensmack@plantbio.uga.edu. . P.K. Wall is with BASF Plant Science, L.L.C., 26 Davis Drive, Research Triangle Park, NC 27709. E-mail: kerr.wall@basf.com. . K. Beckmann and C.W. dePamphilis are with the Department of Biology, the Huck Institutes of the Life Sciences, and the Institute of Molecular Evolutionary Genetics, the Pennsylvania State University. E-mail: phage9@gmail.com, cwd3@psu.edu. . T. Warnow is with the Department of Computer Science, One University Station, STOP C0500, the University of Texas at Austin, Austin, TX 78712. E-mail: tandy@cs.utexas.edu. Manuscript received 29 Oct. 2008; revised 3 Mar. 2009; accepted 30 July 2009; published online 1 Sept. 2009. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2008-10-0186. Digital Object Identifier no. 10.1109/TCBB.2009.68. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM