MOLECULAR PHYLOGENETICS AND EVOLUTION Vol. 4, No.1, March, pp. 64-71, 1995 A Frequency-Dependent Significance Test for Parsimony MIKE STEEL,* PETER J. LOCKHART,t AND DAVID PENNyt * Mathematics Department, University of Canterbury, Christchurch, New Zealand,' and tMolecular Genetics Unit, Massey University, Palmerston North, New Zealand Received May 19, 1994; revised September 30, 1994 We describe techniques for assessing evolutionary trees constructed by the parsimony criteria, when se- quences exhibit irregular base compositions. In partic- ular, we extend a recently described frequency- dependent significance test to handle any number of taxa and describe a modification of the Kishino- Hasegawa sites test. These modifications are useful for detecting historical signals beyond those patterns which arise purely from irregular base compositions between the compared sequences. We apply the test to extend our earlier studies on chloroplast origins using 168 rDNA sequences, where a failure to compensate for irregular base compositions between the compared se- quences provides statistically significant support for unjustified phylogenetic inferences. We also describe how the techniques can be modified to determine how "tree-like" data are, given independent variation in the base frequencies. © 1995 Academic Press, Inc. One of the earliest, and still most widely used, meth- ods for constructing phylogenetic trees is the maxi- mum parsimony technique. Given a tree T, each of whose leaves correspond to an aligned sequence, and a collection C of aligned sequences, the length of T for C-denoted L(C, T)-is the least number of point mutations (substitutions) that needs to occur across the edges of T to account for the observed variation in the sequences. To make this notion more precise, it is useful to re- gard a collection C of k parsimony sites in n aligned sequences as k functions XI' . . . , Xk' where each Xj assigns sequence i (i = 1, ... , n) one of r possible states (r = 4 for DNA sequences; r = 2 for purine/ pyrimidine sequences; r = 20 for amino acid se- quences). For any tree T whose leaves (degree one ver- tices) are numbered 1, ... , n, let L(Xj, T) be the mini- mal number of edges of T which must have different states assigned to their ends in order to extend the function Xj to all the vertices of T (an extension which realizes this minimization is said to be minimal). The length of T for C, written L(C, T) is the sum k L(C, T) = L L(Xj, T), j=1 64 1055-7903/95 $6.00 Copyright © 1995 by Academic Press, Inc. All rights of reproduction in any form reserved. which can be computed efficiently, indeed in O(nk) steps, using Fitch's algorithm (see Hartigan, 1973). However, minimizing this function [finding the tree that minimizes L(C, T)] is not easy, and a fast algo- rithm is unlikely to exist since this problem has been shown to be NP-hard (Graham and Foulds, 1982). Nev- ertheless, a branch-and-bound algorithm due to Hendy and Penny (1982) works acceptably fast on "good" data for values of n up to about 20. The parsimony principle regards T as a better esti- mate that T' of the true evolutionary tree whenever T requires fewer mutations than T'; that is, whenever L(C, T) < L(C, T'). Consequently, the tree (or trees) which minimizes L(C, T), the maximum parsimony tree(s), is taken as the best estimate of the "true" tree. Many phylogenetic studies are based on this criterion (Stewart, 1993). There are two problems associated with this other- wise appealing and simple scheme. First, it has been known for many years (Felsenstein, 1978) that under simple stochastic models of nucleotide substitution, parsimony can be statistically inconsistent for cer- tain parameter choices (constituting the so-called "Felsenstein Zone"). That is, the method will tend to select an incorrect tree with a probability tending to 1 as the sequence length grows (however, as pointed out by Steel et al. (1993b), if parsimony is applied to suit- ably transformed data, parsimony will be consistent for certain nucleotide substitution models). Defenders of parsimony have pointed out that the assumptions implicit in these stochastic models are overly severe and thereby unrealistic; others (notably "pattern" clad- ists) contend that it is better not to assume anything about the evolutionary process. While we do not agree with this second position, let us turn instead to a sec- ond problem. A further difficulty with parsimony arises when the sequences exhibit variation in the frequency of their states, acquired independently and not due to shared an- cestry. In this case parsimony will tend to group to- gether sequences according to their base compositions. This problem has been highlighted recently (particu- larly for ribosomal RNA) by a number of authors [see Lockhart et al. (1992), Hasegawa and Hashimoto (1993), Olsen and Woese (1993), and Klenk et al. (1994)].