Neighbor-Joining with Interval Methods D. Levy 1 , Raazesh Sainudiin 2 , R. Yoshida 3 , and L. Pachter 1 1 University of California at Berkeley, Berkeley, CA 94720, USA; 2 Cornell University, Ithaca NY 14850; 3 Duke University, Durham, NC 27708, USA The software package MJOIN is available at http://bio.math.berkeley.edu/mjoin/ Introduction The Neighbor-Joining algorithm is a recursive procedure to reconstruct a phyloge- netic tree using a transformation of pairwise distances between leaves for identifying cherries in the tree. Pachter and Speyer showed that we can recover an n-leaf tree from the weights of m-leaf subtrees if n 2m - 1 [PS04]. We generalized the cherry picking criterion with estimates of the weights of m-leaf subtrees. We showed that a reconstructed tree from such weights is more accurate than one using pairwise distances. This leads to an improved neighbor-joining algorithm whose total running time is still polynomial in the number of taxa. Neighbor Joining with Pairwise Distances Theorem. (the cherry picking criterion) [SN87, SK88] Suppose D(ij ) is a pairwise distance between taxa i and j . Then, {i, j } is a cherry if A ij = D(ij ) - (r i + r j )/(n - 2), where r i := n k =1 D(ik ), is minimal. Idea. Initialize a star-like tree and find a cherry. Then we compute branch length from the interior node to each leaf. Repeat this process recursively until we find all cherries. 1 2 3 4 8 7 6 5 2 5 4 1 2 1 5 1 2 2 5 2 1 8 8 8 8 8 7 7 7 7 7 6 6 6 6 6 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 Y X Y X Y X X Y Y X Z Figure 1: The traditional Neighbor Joining with pairwise distances. Neighbor Joining with Subtree Weights Notation. Let [n] denote the set {1, 2, ..., n} and [n] m denote the set of all m-element subsets of [n]. Definition. A m-dissimilarity map is a function D : [n] m R 0 . In terms of phylogeny, this corresponds to the weights of m-subtree weights of a tree T . Theorem. Let D m be be an m-dissimilarity map on n leaves, D m : [n] m R 0 correspond to the weights of m-subtree weights of a tree T and we define S (ij ) := X ( [n]\{i, j } m-2 ) D m (ijX ). Then S (ij ) is a tree metric. Furthermore, if T is the additive tree corresponding to this tree metric then T and T have the same tree topology and there is an invertible linear map between their edge weights. Algorithm. (Neighbor Joining with Subtree Weights) Input: n many DNA sequences. Output: A phylogenetic tree T with n leaves. 1. Compute all m-subtree weights via the maximum likelihood. 2. Compute S (ij ) for each pair of leaves i and j . 3. Apply Neighbor Joining method with a tree metric S (ij ) and obtain additive tree T . 4.Using a linear mapping, obtain a weight of each internal edge and each leaf edge of T . Cherry Picking Theorem Theorem. Let T be a tree with n leaves and no nodes of degree 2 and let m be an integer satisfying 2 m n - 2. Let D : [n] m R 0 be the m-dissimilarity map corresponding to the weights of the subtrees of size m in T . If Q D (ab) is a minimal element of the matrix Q D (ab)= n - 2 m - 1 X ( [n]\{i, j } m-2 ) D(ijX ) - X ( [n]\{i} m-1 ) D(iX ) - X ( [n]\{j } m-1 ) D(jX ) then {a, b} is a cherry in the tree T . Note. The theorem by Saitou-Nei and Studier-Keppler is a corollary from Cherry Picking Theorem. Time Complexity If m 3, the time complexity of this algorithm is O (n m ), where n is the number of leaves of T and if m = 2, then the time complexity of this algorithm is O (n 3 ). Note: The running time complexity of the algorithm is O (n 3 ) for both m = 2 and m = 3. Interval Methods In [LYP04], Dissimilarity maps are computed via fastDNAml which implements a gradient flow algorithm with floating-point arithmetic. Instead, apply the rigorously enclosed maximum likelihood estimations [Sai04]. Dissimilarity maps computed via the rigorously enclosed MLEs are guaranteed to be enclosed. Thus, reconstructed trees via the generalized NJ method with these dissimilarity maps are more accurate. Computational Results Problem: Find the NJ tree for 21 S-locus receptor kinase (SRK) sequences [SWY + 05] involved in the self/nonself discriminating self-incompatibility system of the mustard family [Nas02]. Result: Symmetric difference (Δ) between 10, 000 trees sampled from the likeli- hood function via MCMC and the trees reconstructed by 5 methods. DNAml was used in two ways: DNAml(A) is a basic search with no global rearrange- ments, whereas DNAml(B) applies a broader search with global rearrangements and 100 jumbled inputs. Δ NRGNJ fastDNAml DNAml(A) DNAml(B) TrExML 0 0 0 2 3608 0 2 0 0 1 471 0 4 171 6 3619 5614 0 6 5687 5 463 294 5 8 4134 3987 5636 13 71 10 8 5720 269 0 3634 12 0 272 10 0 652 14 0 10 0 0 5631 16 0 0 0 0 7 References [LYP04] D Levy, R Yoshida, and L Pachter. Neighbor joining with subtree weights. preprint, 2004. [Nas02] JB Nasrallah. Recognition and rejection of self in plant reproduction. Science, 296:305–308, 2002. [PS04] L. Pachter and D. Speyer. Reconstructing trees from subtree weights. Applied Mathematics Letters, 17:615 – 621, 2004. [Sai04] R Sainudiin. Enclosing the maximum likelihood of the simplest DNA model evolving on fixed topologies: towards a rigorous framework for phylogenetic inference. Technical Report BU1653-M, Department of Biol. Stats. and Comp. Bio., Cornell University, 2004. [SK88] J. A. Studier and K. J. Keppler. A note on the neighbor-joining method of saito and nei. Mol. Biol. Evol., 5:729 – 731, 1988. [SN87] N. Saitou and M. Nei. The neighbor joining method: a new method for reconstructing phylogenetic trees. 1987. [SWY + 05]R Sainudiin, SW Wong, K Yogeeswaran, J Nasrallah, Z Yang, and R Nielsen. Detecting site- specific physicochemical selective pressures: applications to the class-I HLA of the human major histocompatibility complex and the SRK of the plant sporophytic self-incompatibility system. Journal of Molecular Evolution, in press, 2005.