SciPhy: A Cloud-based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes * Kary A. C. S. Ocaña 1 , Daniel de Oliveira 1 , Eduardo Ogasawara 1,2 , Alberto M. R. Dávila 3 , Alexandre A. B. Lima 1 , and Marta Mattoso 1 1 Computer Science, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil 2 Federal Center of Technological Education, Rio de Janeiro, Brazil 3 Laboratory of Computational and Systems Biology, IOC, FIOCRUZ, Rio de Janeiro, Brazil {kary, danielc, ogasawara, assis, marta}@cos.ufrj.br, davila@fiocruz.br Abstract. Bioinformatics experiments are rapidly evolving with genomic projects that analyze large amounts of data. This fact demands high performance computation and opens up for exploring new approaches to provide better control and performance when running experiments, including Phylogeny/Phylogenomics. We designed a phylogenetic scientific workflow, named SciPhy, to construct phylogenetic trees from a set of drug target enzymes found in protozoan genomes. Our contribution is the development, implementation and test of SciPhy in public cloud computing environments. SciPhy can be used in other Bioinformatics experiments to control a systematic execution with high performance while producing provenance data. Keywords: Phylogeny, Protozoa, Scientific Workflow, Cloud computing 1 Introduction Protozoan species are microscopic unicellular eukaryotes, some of them being pathogenic and causing severe illnesses in humans and animals. In a set of ten diseases defined as research priorities by the World Health Organization’s Special Program for Research and Training in Tropical Diseases (http://www.who.int/tdr), four of them are caused by protozoan parasites (malaria, leishmaniasis, Chagas disease, and African Trypanosomiasis). Genomics and proteomics have created a new paradigm for drug, vaccine and diagnostic discovery process. Bioinformatics plays a key role in exploiting biological data to identify potential drug targets and to gain insights about molecular mechanisms that underlie diseases [1]. One strategy used in genomics, which can be extrapolated for drug target discovery/identification, involves the analysis of the function of sequences of determined organism. Here, the unknown sequences can often be accurately inferred by identifying other sequences homologous to them that were properly annotated. Given a query set of new sequences or genes, there are at least four approaches to infer similarity and occasionally homology: (i) local sequence comparisons (Blast), (ii) motif/signature analysis, (iii) profile-based similarity (PSSM and HMM), and (iv) * This work was partially sponsored by CAPES, FAPERJ and CNPq.