Topology Improves Phylogenetic Motif Functional Site Predictions Dukka B. KC and Dennis R. Livesay Abstract—Prediction of protein functional sites from sequence-derived data remains an open bioinformatics problem. We have developed a phylogenetic motif (PM) functional site prediction approach that identifies functional sites from alignment fragments that parallel the evolutionary patterns of the family. In our approach, PMs are identified by comparing tree topologies of each alignment fragment to that of the complete phylogeny. Herein, we bypass the phylogenetic reconstruction step and identify PMs directly from distance matrix comparisons. In order to optimize the new algorithm, we consider three different distance matrices and 13 different matrix similarity scores. We assess the performance of the various approaches on a structurally nonredundant data set that includes three types of functional site definitions. Without exception, the predictive power of the original approach outperforms the distance matrix variants. While the distance matrix methods fail to improve upon the original approach, our results are important because they clearly demonstrate that the improved predictive power is based on the topological comparisons. Meaning that phylogenetic trees are a straightforward, yet powerful way to improve functional site prediction accuracy. While complementary studies have shown that topology improves predictions of protein-protein interactions, this report represents the first demonstration that trees improve functional site predictions as well. Index Terms—Phylogenetic motif, functional site prediction, phylogenetic tree, distance matrix. Ç 1 INTRODUCTION P REDICTING functionally relevant information from a newly discovered protein is one of the most challenging jobs in this postgenomic era. Currently, there are two major paradigms within this realm. The first is related to classifica- tion of a protein into its broad functional class (e.g., gene ontology [1] or enzyme classification number [2]). While this information is clearly important, it provides little mechanistic insight. As such, the second paradigm uses computational strategies to predict functional sites that define the function and/or regulation of the protein. Not only do functional site descriptions allow for improved descriptions of how a given protein functions, they can also be used to identify functionally deleterious mutations. Consequently, there has been a steady output of new computational approaches for protein functional site prediction. Current approaches can be classified as either sequence-based (actually, alignment), structure-based, or a combination thereof [3], [4]. The most common methods are based on conservation analysis across a multiple sequence alignment, which has been used to detect functional sites in myriad settings [5], [6], [7]. Several more sophisticated approaches that attempt to detect some additional sort of “feature” conservation have also been developed [8]. Most of these approaches attempt to identify positions whose variability is dictated by the functional evolution of the family [9], [10]. For example, evolutionary trace (ET) [9] and its variants [11], [12], [13] attempt to identify trace residues, which are individual alignment positions that are conserved within functionally distinct subfamilies. Once the trace residues have been identified, the most common usage of ET is to map these positions to structure, and structural clusters of ET+ con- served mutations are put forth as functional site predictions. Previously, we have demonstrated that alignment frag- ments taken from a priori identified functional sites tend to conserve the overall familial phylogeny [14]. Subsequently, we reversed this scenario in order to predict protein functional sites by screening all possible alignment fragments for this phylogenetic feature [10]. Our method uses a sliding alignment window algorithm to scan all possible fragments. A tree is generated for each fragment, which is compared to the overall familial tree using a partition metric algorithm. Alignment fragments that most closely reflect the overall phylogeny are put forth as functional site predictions. The resultant phylogenetic motifs (PMs) are very likely to correspond to protein functional sites [10], [15], [16], [17], [18]. The predictive power of the PM approach was most recently demonstrated in our report that used PM informa- tion to make statistically significant improvements in pre- diction accuracy over raw conservation score [19]. Our PM identification algorithm has been implemented in an easy to use Web server called MINER [20]. (Note that the MINER has been moved to http://coit-apple01.uncc.edu/MINER.) While the predictive value of the MINER approach has been shown repeatedly, one potential source of criticism results from building trees on such small alignment frag- ments. This concern is partially mitigated by our results that show (via bootstrap analysis) PM tree stability is acceptable [10]. Nevertheless, if methods based on comparison of the underlying distance matrices (versus trees) can be shown to perform equally well, this would represent an attractive alternative. Recently, Manning et al. [21] introduced the SMERFS algorithm that uses distance matrices as a source of evolutionary information. In a manner analogous to MINER, 226 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 . The authors are with the Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223. E-mail: {dbahadur, drlivesa}@uncc.edu. Manuscript received 12 Mar. 2009; revised 9 June 2009; accepted 16 June 2009; published online 9 July 2009. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2009-03-0034. Digital Object Identifier no. 10.1109/TCBB.2009.60. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM