Detecting possibly saturated positions in 18S and 28S sequences and their influence on phylogenetic reconstruction of Annelida (Lophotrochozoa) Torsten H. Struck a,b, * , Maximilian P. Nesnidal b , Günter Purschke b , Kenneth M. Halanych a a Auburn University, 101 Rouse Building, Auburn, AL 36849, USA b Universität Osnabrück, FB05 Biologie/Chemie, AG Zoologie, Barbarastrabe 11, D-49076 Osnabrück, Germany article info Article history: Received 11 November 2007 Revised 25 April 2008 Accepted 13 May 2008 Available online 20 May 2008 Keywords: C factor O/E ratio Annelida Saturation rRNA Phylogeny I ss abstract Phylogenetic reconstructions may be hampered by multiple substitutions in nucleotide positions obliter- ating signal, a phenomenon called saturation. Traditionally, plotting ti/tv ratios against genetic distances has been used to reveal saturation by assessing when ti/tv stabilizes at 1. However, interpretation of results and assessment of comparability between different data sets or partitions are rather subjective. Herein, we present the new C factor, which quantifies convergence of ti/tv ratios, thus allowing compa- rability. Furthermore, we introduce a comparative value for homoplasy, the O/E ratio, based on alterations of tree length. Simulation studies and an empirical example, based on annelid rRNA-gene sequences, show that the C factor correlates with noise, tree length and genetic distance and therefore is a proxy for saturation. The O/E ratio correlates with the C factor, which does not provide an intrinsic threshold of exclusion, and thus both together can objectively guide decisions to exclude saturated nucleotide posi- tions. However, analyses also showed that, for reconstructing annelid phylogeny using Maximum Likeli- hood, an increase in numbers of positions improves tree reconstruction more than does the exclusion of saturated positions. Ó 2008 Elsevier Inc. All rights reserved. 1. Introduction Reliable reconstruction of phylogenies using molecular data is affected by several factors such as long branches, heterogeneity of base frequencies, and rate heterogeneity among positions (e.g., Kuhner and Felsenstein, 1994; Lake, 1994; Lockhart et al., 1994; Xia et al., 2003). All of these factors can result in homoplasy, which can decrease phylogenetic signal and hamper reconstruction ef- forts. An indicator of homoplasy in molecular data sets can be sat- uration of transition events relative to other mutations (e.g., Halanych and Robinson, 1999; Jördens et al., 2004; Lopez et al., 1999; Nickrent et al., 2000; Philippe and Forterre, 1999; Simon et al., 1994; Struck et al., 2002a; Swofford et al., 1996). Saturation is invoked when no further increase in transitions can be observed despite increasing genetic distance, indicating that multiple substi- tutions at nucleotide positions have occurred. Therefore, when a data set is saturated, phylogenetic reconstruction may be misled by homoplasious signal (e.g., Milinkovitch et al., 1996; Simon et al., 1994). Saturation is especially problematic for taxa that radi- ated rapidly long ago. For example, the radiation of annelid worms within Lophotrochozoa apparently was fast and resulted in branching patterns with short internodes in the basal part of the annelid tree and with branch lengths rapidly increasing towards the tips. Such branching patterns accumulate only a few informa- tive substitutions along the internal and more basal branches, but numerous homoplasious substitutions on terminal branches (Struck et al., 2002a). For example, in Struck et al.’s (2007) analyses of three nuclear genes, the ratio of terminal branch length to inter- nal branch length is 4.2/1 even excluding the extremely long branch leading to the highly divergent taxon, Ophryotrocha labron- ica. Thus, phylogenetic information is likely to be eroded due to multiple substitutions, or homoplasy, along the terminal branches. To avoid these problems, several authors have used conservative genes such as elongation factor 1a, 18S and/or 28S rDNA (e.g., Re- gier and Shultz, 1997; Struck et al., 2002a, 2007; Xia et al., 2003). However, even in these genes certain regions may be saturated and therefore mislead reconstruction. In contrast, excluding prob- lematic regions may leave too little phylogenetic information to re- solve the deep branches (Xia et al., 2003). Traditionally, saturation is shown by plotting either numbers of substitutions or the transition-to-transversion (ti/tv) ratios of all pairwise comparisons of taxa in an alignment against the genetic distances (p) between these pairs (e.g., Halanych and Robinson, 1999; Jördens et al., 2004; Struck et al., 2002a). In the latter case, con- vergence upon a ti/tv value of 1 is expected with the approach of 1055-7903/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2008.05.015 * Corresponding author. Address: Universität Osnabrück, FB05 Biologie/Chemie, AG Zoologie, Barbarastrabe 11, D-49076 Osnabrück, Germany. Fax: +49 541 969 2587. E-mail addresses: struck@biologie.uni-osnabrueck.de (T.H. Struck), nesnidal@ gmail.com (M.P. Nesnidal), Purschke@biologie.uni-osnabrueck.de (G. Purschke), ken@auburn.edu (K.M. Halanych). Molecular Phylogenetics and Evolution 48 (2008) 628–645 Contents lists available at ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev