MOLECULAR PHYLOGENETICS AND EVOLUTION
Vol. 4, No.1, March, pp. 64-71, 1995
A Frequency-Dependent Significance Test for Parsimony
MIKE STEEL,* PETER J. LOCKHART,t AND DAVID PENNyt
* Mathematics Department, University of Canterbury, Christchurch, New Zealand,' and tMolecular Genetics Unit, Massey University,
Palmerston North, New Zealand
Received May 19, 1994; revised September 30, 1994
We describe techniques for assessing evolutionary
trees constructed by the parsimony criteria, when se-
quences exhibit irregular base compositions. In partic-
ular, we extend a recently described frequency-
dependent significance test to handle any number of
taxa and describe a modification of the Kishino-
Hasegawa sites test. These modifications are useful for
detecting historical signals beyond those patterns
which arise purely from irregular base compositions
between the compared sequences. We apply the test to
extend our earlier studies on chloroplast origins using
168 rDNA sequences, where a failure to compensate for
irregular base compositions between the compared se-
quences provides statistically significant support for
unjustified phylogenetic inferences. We also describe
how the techniques can be modified to determine how
"tree-like" data are, given independent variation in the
base frequencies. © 1995 Academic Press, Inc.
One of the earliest, and still most widely used, meth-
ods for constructing phylogenetic trees is the maxi-
mum parsimony technique. Given a tree T, each of
whose leaves correspond to an aligned sequence, and
a collection C of aligned sequences, the length of T
for C-denoted L(C, T)-is the least number of point
mutations (substitutions) that needs to occur across
the edges of T to account for the observed variation in
the sequences.
To make this notion more precise, it is useful to re-
gard a collection C of k parsimony sites in n aligned
sequences as k functions XI' . . . , Xk' where each Xj
assigns sequence i (i = 1, ... , n) one of r possible
states (r = 4 for DNA sequences; r = 2 for purine/
pyrimidine sequences; r = 20 for amino acid se-
quences). For any tree T whose leaves (degree one ver-
tices) are numbered 1, ... , n, let L(Xj, T) be the mini-
mal number of edges of T which must have different
states assigned to their ends in order to extend the
function Xj to all the vertices of T (an extension which
realizes this minimization is said to be minimal). The
length of T for C, written L(C, T) is the sum
k
L(C, T) = L L(Xj, T),
j=1
64
1055-7903/95 $6.00
Copyright © 1995 by Academic Press, Inc.
All rights of reproduction in any form reserved.
which can be computed efficiently, indeed in O(nk)
steps, using Fitch's algorithm (see Hartigan, 1973).
However, minimizing this function [finding the tree
that minimizes L(C, T)] is not easy, and a fast algo-
rithm is unlikely to exist since this problem has been
shown to be NP-hard (Graham and Foulds, 1982). Nev-
ertheless, a branch-and-bound algorithm due to Hendy
and Penny (1982) works acceptably fast on "good" data
for values of n up to about 20.
The parsimony principle regards T as a better esti-
mate that T' of the true evolutionary tree whenever T
requires fewer mutations than T'; that is, whenever
L(C, T) < L(C, T'). Consequently, the tree (or trees)
which minimizes L(C, T), the maximum parsimony
tree(s), is taken as the best estimate of the "true" tree.
Many phylogenetic studies are based on this criterion
(Stewart, 1993).
There are two problems associated with this other-
wise appealing and simple scheme. First, it has been
known for many years (Felsenstein, 1978) that under
simple stochastic models of nucleotide substitution,
parsimony can be statistically inconsistent for cer-
tain parameter choices (constituting the so-called
"Felsenstein Zone"). That is, the method will tend to
select an incorrect tree with a probability tending to 1
as the sequence length grows (however, as pointed out
by Steel et al. (1993b), if parsimony is applied to suit-
ably transformed data, parsimony will be consistent
for certain nucleotide substitution models). Defenders
of parsimony have pointed out that the assumptions
implicit in these stochastic models are overly severe
and thereby unrealistic; others (notably "pattern" clad-
ists) contend that it is better not to assume anything
about the evolutionary process. While we do not agree
with this second position, let us turn instead to a sec-
ond problem.
A further difficulty with parsimony arises when the
sequences exhibit variation in the frequency of their
states, acquired independently and not due to shared an-
cestry. In this case parsimony will tend to group to-
gether sequences according to their base compositions.
This problem has been highlighted recently (particu-
larly for ribosomal RNA) by a number of authors [see
Lockhart et al. (1992), Hasegawa and Hashimoto (1993),
Olsen and Woese (1993), and Klenk et al. (1994)].