Does random tree puzzle produce Yule–Harding trees in the many-taxon limit? Sha Zhu, Mike Steel ⇑ Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand article info Article history: Received 21 August 2012 Received in revised form 2 February 2013 Accepted 8 February 2013 Available online 19 February 2013 Keywords: Phylogenetic tree Tree-puzzle Polyá urn Centroid vertex abstract It has been suggested that a random tree puzzle (RTP) process leads to a Yule–Harding (YH) distribution, when the number of taxa becomes large. In this study, we formalize this conjecture, and we prove that the two tree distributions converge for two particular properties, which suggests that the conjecture may be true. However, we present statistical evidence that, while the two distributions are close, the RTP appears to converge on a different distribution than does the YH. By way of contrast, in the concluding section we show that the maximum parsimony method applied to random two-state data leads a very different (PDA, or uniform) distribution on trees. Ó 2013 Elsevier Inc. All rights reserved. 1. Introduction The Maximum likelihood (ML) approach [4,6,5] is generally considered to be a reliable way of estimating phylogenies from DNA sequences. However, ML is not always feasible for large num- bers of species, because of the intensive computation required. Methods that use ‘four point subsets’ [3] reduce the complexity of the problem, and have assisted numerous studies [2,15,20,21]. The four points subtree is known as the quartet tree. Quartet puzzling (QP) [21] is an algorithm to infer a tree on n taxa by using the quartet trees derived from DNA sequences. It firstly computes the likelihood of all n 4 quartets. As there are three possible topologies for any four taxa, the quartet tree which returns the greatest ML value is used (any ties are broken uniformly at ran- dom). At the puzzling step, the order of inserting new leaf nodes is randomized. A seed tree is built from the first four elements of the ordered leaf node sequence. From this point on, leaves are at- tached sequentially by the following procedure: when a new leaf x is to be attached to the existing tree T, quartet trees are built from quartets formed from x and all subsets of size three that are chosen from the existing leaf set. If the ML quartet tree of fi; j; k; xg is ijjkx, then weight 1 is added to the edges on the path in T connecting the two leaves i and j. This process is repeated for all such quartet trees, and x is then attached to the edge which has the minimal weight. An example is given in Fig. 1. Since the order of adding leaves is randomized, this can lead to variation in the resulting tree topologies, and so a consensus tree of numerous replicates is used as the output tree. The program Tree puzzle (TP) [16] is a parallel version of QP, which performs inde- pendent puzzling steps simultaneously. The trees generated by either the QP or TP process depend on the biological sequences we have for the taxa. To investigate how the TP process behaves on randomized quartets, Vinh et al. [22] performed a simulation study on a so-called random tree puzzle (RTP) process. This assumes that no prior molecular information is given. Therefore, for the same quartet set, all three tree topolo- gies are equally likely. The authors compare the empirical proba- bilities of tree topologies against the theoretical probabilities from the proportional to distinguishable arrangement (PDA) model and the Yule–Harding (YH) model. Table 1 from [22] reveals that the RTP’s empirical probabilities are very close to the YH theoreti- cal probabilities (indeed, there are two cases where these probabil- ities are identical). As it seems that the differences between the empirical and theoretical probabilities decrease as the number of taxa increases, Vinh et al. [22] suggest that the RTP process con- verges to the YH process as n (the number of taxa) grows. The authors provided further evidence for their conjecture by compar- ing some properties of RTP trees with YH trees. Recall that a cherry in a tree is a pair of leaves that are adjacent to the same vertex. Then Vinh et al. [22] found that the mean and variance of the num- ber of cherries were similar under the RTP simulation and the the- oretical value under the YH process [13]. Although Vinh et al. [22] provided evidence to suggest the two distributions appear to become very similar as n grows, they did not provide a formal statement or proof of their claim that the two distributions converge. In this project, we investigate the RTP process further using mathematical and statistical methods. Our results demonstrate that certain properties of the trees that are near the ‘periphery’ of the tree (i.e. near the leaves) converge 0025-5564/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.mbs.2013.02.003 ⇑ Corresponding author. Tel.: +64 21329705. E-mail addresses: sha.joe.zhu@gmail.com (S. Zhu), mike.steel@canterbury.ac.nz (M. Steel). Mathematical Biosciences 243 (2013) 109–116 Contents lists available at SciVerse ScienceDirect Mathematical Biosciences journal homepage: www.elsevier.com/locate/mbs