Full modeling versus summarizing gene-tree uncertainty: Method choice and species-tree accuracy L. Lacey Knowles ,1 , Hayley C. Lanier 1 , Pavel B. Klimov, Qixin He Department of Ecology and Evolutionary Biology, Museum of Zoology, University of Michigan, 1109 Geddes Ave., Ann Arbor, MI 48109-1079, USA article info Article history: Received 11 December 2011 Revised 6 April 2012 Accepted 8 July 2012 Available online 23 July 2012 Keywords: Bayesian inference Coalescence Gene tree Maximum likelihood Species tree abstract With the proliferation of species-tree methods, empiricists now have to confront the daunting task of method choice. Such decisions might be made based on theoretical considerations alone. However, the messiness of real data means that theoretical ideals may not hold in practice (e.g., with convergence of complicated MCMC algorithms and computational times that limit analyses to small data sets). On the other hand, simplifying assumptions made by some approaches may compromise the accuracy of spe- cies-tree estimates. Here we examine the purported tradeoff between accuracy and computational sim- plicity for species-tree analysis, focusing on the different ways the approaches treat gene-tree uncertainty. By considering a diversity of species trees, as well as different sampling designs and total sampling efforts, we not only compare the accuracy of species-tree estimates across methods, but we also partition the variation in accuracy across factors to identify their relative importance. This analysis shows that although the method of analysis affects accuracy, other factors – namely, the history of species diver- gence and aspects of the sampling design – have a larger impact. Despite a full modeling of gene tree uncertainty (e.g., using a Bayesian framework), species-tree estimates may not be accurate, particularly for recent diversification histories. Nevertheless, we demonstrate how factors within the control of the empirical investigator (e.g., decisions about sampling) improve the accuracy of species tree estimates, and more so than the method of analysis. Lastly, with much of the attention on species-tree analyses focused on the discord among loci arising from the coalescent, this work also highlights a previously overlooked key determinant of species-tree accuracy for recent divergences – the level of genetic varia- tion at a locus, which has important implications for improving species-tree estimates in practice. Ó 2012 Elsevier Inc. All rights reserved. 1. Introduction The availability of multilocus sequence data and species-tree approaches that explicitly model the discord frequently observed among the gene trees of individual loci (Knowles and Kubatko, 2010) means that empiricists are increasingly faced with the ques- tion – what method to use? Methods differ with respect to the information extracted from the data. Some explicitly model all sources of variation in the data, including discordance in the gene- alogies across loci arising from the coalescence and uncertainties in estimates of the gene trees (Heled and Drummond, 2010; Liu, 2008) whereas others summarize aspects of the data to estimate a species tree (e.g., summarizes gene-tree discord or models only the variation arising from the coalescent and not the mutational process (Kubatko et al., 2009; Maddison and Knowles, 2006)). De- spite the recent proliferation of methods for species-tree inference (Heled and Drummond, 2010; Knowles, 2009; Liu, 2008; Liu et al., 2009), detailed testing of the relative accuracy of the methods has lagged behind. This gap in information reflects two difficulties with performing such evaluations. First, although the initial studies on species-tree estimation showed that species relationships could be estimated accurately despite discord across loci (Carstens and Knowles, 2007; Edwards et al., 2007; Maddison and Knowles, 2006), the accuracy of species-tree estimates is very much depen- dent upon the actual history of species diversification and the sam- pling design of the study (Chung and Ane, 2011; Eckert and Carstens, 2008; Huang et al., 2010; McCormack et al., 2009). This means that both a diversity of species trees and sampling designs need to be considered before any general conclusions about the relative accuracy of species-tree approaches can be made. Second, the computational demands of some of the methods have slowed evaluation, especially with respect to comparing the relative accu- racy across methodologies. For example, comparisons among methods that include the computationally demanding Bayesian programs are limited to specific case studies (Cranston, 2010; Flórez-Rodríguez et al., 2011; Kubatko and Gibbs, 2010; Linnen, 2010; Than and Nakhleh, 2009) or simulations under a set of 1055-7903/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ympev.2012.07.004 Corresponding author. Fax: +1 734 763 4080. E-mail addresses: knowlesl@umich.edu (L.L. Knowles), hclanier@umich.edu (H.C. Lanier), pklimov@umich.edu (P.B. Klimov), heqixin@umich.edu (Q. He). 1 These authors are contributed equally to this work. Molecular Phylogenetics and Evolution 65 (2012) 501–509 Contents lists available at SciVerse ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev