Tpms: a Tree Pattern-matching Utility for Querying Gene Trees Collections Thomas BIGOT, Vincent DAUBIN and Guy PERRIÈRE Laboratoire de Biométrie et Biologie Évolutive, UMR 5558 CNRS, Université Claude Bernard – Lyon 1, 43 bd. du 11 Novembre 1918, 69622, Villeurbanne Cedex, France {thomas.bigot,vincent.daubin,guy.perriere}@univ-lyon1.fr Abstract Tpms is a portable C++ program allowing to retrieve gene trees from large collections, this according to tree patterns defined by the users. It can be used for different purposes such as orthologs search or horizontal gene transfers identification. Documentation, source code, as well as Linux and MacOSX binaries can be freely downloaded at ftp://pbil.univ- lyon1.fr/pub/mol_phylogeny/tpms/. Keywords phylogenetic trees, tree pattern-matching, orthologs, horizontal gene transfers. 1 Introduction Comparative genomics is a common approach in sequence analysis, and many biological results have been obtained through its use. Among the different programs and packages developed for comparative genomics, those using the information contained in phylogenetic trees are of special interest. Indeed, orthology detection methods using phylogenetic trees usually perform better than the simpler (and easier to use) methods based on best reciprocal hits of sequence similarity scores [1]. In that context we developed tpms, a program allowing to retrieve gene trees from a tree collection, this according to patterns defined by the users. Those patterns usually include some kind of constraints, such as node nature (duplication, speciation), or subtree content. Therefore this program can be used for orthologs search, but also for any studies that require to retrieve sets of gene families matching constraints in their corresponding phylogenetic tree (e.g., gene duplications identification or horizontal gene transfers prediction). 2 System and methods The tree pattern-matching algorithm used in tpms is a C++ version of the one from the RAP program implemented in FamFetch [2]. It requires the Bio++ [3, 4] and Boost [5] libraries to be run. This new implementation consists in a command-line standalone binary and is not embedded into a graphical interface. Moreover, it is also no longer dependant on the use of the HOVERGEN, HOGENOM and HOMOLENS gene families databases [6], and it can be used on collections build by the users. Binaries of the program are provided for Linux and MacOSX (Intel architectures only), as well as the source code at ftp://pbil.univ- lyon1.fr/pub/mol_phylogeny/tpms/. 3 Program use At first, the user needs to build a gene trees collection in the RAP format [7]. This collection will be then accessed by the program when performing pattern searches. Collection construction can be done easily through the use of tpms_mkdb program, distributed with tpms. This tool uses a reference species tree, also in RAP format, and a set of individual gene trees in standard Newick format. The species tree can contain unresolved nodes (multifurcations), but not the individual gene trees. Tree patterns have to be written in an extended Newick format. The simplest constraints that can be introduced are represented by the taxa found on the leaves of the pattern. The labels can stand for a given species (e.g., Homo sapiens) or larger taxonomic groups (e.g., Primates). For instance, the pattern: ((Homo sapiens,Pan troglodytes),Rodentia)