Syst. Biol. 50(4):525–539, 2001 Bias in Phylogenetic Estimation and Its Relevance to the Choice between Parsimony and Likelihood Methods DAVID L. S WOFFORD, 1, 6 PETER J. WADDELL, 2 J OHN P. HUELSENBECK, 3 PETER G.FOSTER, 1, 7 PAUL O. LEWIS , 4 AND J AMES S. ROGERS 5 1 Laboratory of Molecular Systematics, National Museum of Natural History, Smithsonian Institution Museum Support Center, 4210 Silver Hill Road, Suitland, Maryland 20746, USA; E-mail: swofford@lms.si.edu 2 Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand; E-mail: waddell@onyx.si.edu 3 Department of Biology, University of Rochester, Rochester, New York 14627, USA; E-mail: johnh@brahms.biology.rochester.edu 4 Department of Ecology and Evolutionary Biology, The University of Connecticut, U-43, 75 N. Eagleville Road, Storrs, Connecticut 06269-30437 , USA; E-mail: plewis@uconnvm.uconn.edu 5 Department of Biological Sciences, University of New Orleans, New Orleans, Louisiana 70148, USA; E-mail: jsrogers@uno.edu It is now widely recognized that un- der relatively simple models of stochastic change, phylogenetic inference methods can actively mislead investigators attempting to estimate evolutionary trees from molecu- lar sequences and other data. One instance of this phenomenon is “long-branch attrac- tion,” in which some pairs of taxa have a higher probability of sharing the same character state because of parallel or con- vergent changes along long branches than do taxa that are more closely related be- cause they have retained some same state from a common ancestor. Methods that sys- tematically underestimate the actual amount of divergence may then become statisti- cally inconsistent or “positively misleading” (Felsenstein, 1978; Hendy and Penny, 1989), estimating an incorrect tree with an increas- ing certainty as the amount of character data increases. Although usually associated with parsimony methods, long-branch attraction can also afict maximum likelihood and dis- tance analyses when the assumed substitu- tion models of these methods are strongly violated (e.g., Huelsenbeck and Hillis, 1993; Huelsenbeck, 1995; Waddell, 1995:377– 404; Gaut and Lewis, 1995; Chang, 1996a; Lockhart et al., 1996; Sullivan and Swofford, 1997). In this case, although the methods are explicitly designed to deal with superim- posed substitutions (multiple hits), the un- derlying models predict fewer of these than 6 Current address: The Natural History Museum, Cromwell Road, London SW7 5BD, U.K.; E-mail: p.foster@nhm.ac.uk 7 Current address: School of Computational Science and Information Technology, Florida State University, Tallahassee, Florida, 32306-4120. actually occur and thus do not go far enough in correcting for the problem. Inconsistency can also arise under parsimony, even when all branches have the same length (Kim, 1996), although in this case there must still be particular imbalances in the total lengths of the paths from internal nodes to tips of the tree; “long-path attraction” would describe this phenomenon. Long-branch attraction has been widely used, and abused, in justifying choices of methods and in explaining anomalous re- sults. Critics of the relevance of long-branch attraction and related artifacts have generally taken two tacks. The rst (e.g., Farris, 1983) claims that the demonstration of long-branch attraction requires simple and unrealistic models of evolutionary change. As pointed out by Kim (1996), this argument lacks force because conditions that lead to inconsistency are much more general and complex than those outlined by Felsenstein (1978); further relaxation of Felsenstein’s conditions simply exacerbates the problem. The second line of argument (e.g., Siddall and Kluge, 1997) follows from the fact that “truth” is unknow- able in science generally; because it is not possible to be certain that the analysis of a real data set has been compromised by long- branch attraction, the ability of a method to converge, in principle, to the correct solution with increasing amounts of data is irrelevant. In this view, “‘accuracy’ is rendered empty as an empirical claim” (Siddall and Kluge, 1997:318). Proponents of model-based (or statistical) methods that seek to avoid inconsistency attributable to long-branch or long-path artifacts have not been dissuaded by this argument. They certainly appreciate 525