1 GPX: A Tool for the Exploration and Visualization of Genome Evolution (This is a longer version of the paper published in the proceedings) Neha Nahar Lutz Hamel Maria S. Popstova J. Peter Gogarten Department of Computer Science and Statistics, University of Rhode Island, 9 Greenhouse Road, Kingston, RI 02881. nnahar@cs.uri.edu, hamel@cs.uri.edu Department of Molecular and Cell Biology, University of Connecticut, 91 North Eagleville Road, Storrs, CT 06269-3125. maria.poptsova@uconn.edu, gogarten@uconn.edu Abstract Early life on Earth has left many traces that can be utilized to reconstruct the history of life. This information is present in the form of fossils, geological records and also in information retained in living organisms. Gene sequences are now recognized as an invaluable document of life’s history on Earth. Ever since Darwin the Tree of Life has provided a framework to study the evolution of organisms. However, comparative genome analyses have shown that genomes are mosaics where different parts have different histories. One of the reasons for this is the exchange of genes between species. Due to this horizontal gene transfer the Tree of Life concept is transforming to a Web of Life where different parts of a genome possess different evolutionary histories compared to the accepted evolutionary history of the corresponding species. Clustering gene families based on the phylogenetic information they retain allows extracting a majority consensus for the genomes’ evolution, and the determination of genes that have a conflicting phylogeny. The latter is of interest in the context of comparative genomics of prokaryotes because these conflicts point towards possible horizontal transfers of genes and metabolic pathways between divergent organisms. We have created a web-based tool Gene Phylogeny eXplorer (GPX) that facilitates comparative genome analysis of different species. GPX displays results as an interactive map that allows users to explore and interpret genomic data representing gene evolution. It allows the visualization of consensus and conflicting evolutionary histories of genes. The novel aspect of our approach is that we do not try to analyze DNA sequences directly but instead use self-organizing maps to find structure (clusters) in a space spanned by all possible evolutionary relationships between the genomes in questions. Since the number of possible evolutionary trees grows factorially with the number of genomes we use smaller quanta of phylogenetic information, in particular we use bipartitions, to represent the evolutionary relationships between genomes. The number of possible bipartitions grows exponentially with the number of genomes and therefore grows much slower than the number of evolutionary trees making it amenable for a computational approach. The structure of the resulting clusters and in particular the patterns of bipartition support within these clusters provide important information on the origin of individual genes. If a strongly supported bipartition for a gene conflicts with the consensus tree then it is most probably due to a horizontal gene transfer event. 1. Introduction The Tree Of Life has provided a framework to study the evolution of organisms [1]. Phylogenetic trees are used to depict the evolution of organisms or of molecules. However, comparative genome analyses have shown that genomes are mosaics where different parts have different histories [2-5]. These findings questioned the validity of the tree concept, especially for prokaryotic species [6, 7]. Individual genes may travel from one species to another, a core of infrequently transferred genes might represent a tree- like organismal history, genomes that had independent evolutionary histories might have fused to form a new line of descent, and highways of gene sharing [8]