CD-HIT Workflow Execution on Grids using Replication Heuristics J. L. V´azquez-Poletti E. Huedo R. S. Montero I. M. Llorente Departamento de Arquitectura de Computadores y Autom´ atica Facultad de Inform´ atica, Universidad Complutense de Madrid 28040 Madrid, Spain Abstract Grid Computing has proven to be a solution for big workflow execution, especially in Bioinformatics. How- ever, Grid nature itself introduces overheads that make its use in many cases an unfeasible solution if consider- ing wall-time. Different heuristics such as list schedul- ing, agglomeration and replication are available for op- timizing workflow execution. In particular, the replica- tion heuristics have been previously used in heteroge- neous environments with good results. In this work, we analyze their use for workflow scheduling on Grid in- frastructures. In particular, we study its applications to an intree workflow, generated by the distribution of the CD-HIT application. The experiments were conducted on a testbed made of resources from two different grids and results show a significant reduction of the workflow execution time. 1 Introduction Workflow management systems and Grid Comput- ing are providing solutions to problems proposed by Bioinformatics. Workflow management systems [22] al- low the execution of complex applications than can be divided in tasks with data dependencies. Grid Com- puting, on the other hand, offers the applications ac- cess to a great amount of computing resources. * This research was supported by Consejer´ ıa de Educaci´on of Comunidad de Madrid, Fondo Europeo de Desarrollo Regional (FEDER) and Fondo Social Europeo (FSE), through BioGridNet Research Program S-0505/TIC/000101, and by Ministerio de Ed- ucaci´on y Ciencia, through research grant TIN2006-02806. Also, this work makes use of results produced by the Enabling Grids for E-sciencE project, a project co-funded by the European Com- mission (under contract number INFSO-RI-031688) through the Sixth Framework Programme. EGEE brings together 91 part- ners in 32 countries to provide a seamless Grid infrastructure available to the European research community 24 hours a day. Full information is available at http://www.eu-egee.org/. In a previous paper [21], we considered a Bioin- formatics application, CD-HIT (Cluster Database at High Identity with Tolerance) [11], for its porting to the Grid. This application performs protein cluster- ing, which consists in removing redundant sequences from a protein database in order to generate a database of only the representatives. Protein clustering can be applied in many activities such as protein family clas- sification, domain analysis, organization of large pro- tein databases or improving database search perfor- mance. However, the Grid version of CD-HIT didn’t provide good performance results, even if it served to bypass memory constraints and so process large data sets. This happened because the nature of the Grid (dynamism, heterogenity and high fault rate). As optimization is needed in this workflow, de- scribed with the previous work in Section 2, we consid- ered well known heuristics that proved to throw good results in other heterogeneous computational infras- tructures. These optimization strategies are described in Section 3. However, in Section 4 we focused in the replication strategy for optimizing the cited workflow and then, evaluated it through experimental results in Section 5. Finally, some conclusions and future work are shown at the end of the paper. 2 The Application The CD-HIT application was successfully ported to the Grid [21] using the GridW ay metascheduler [9]. However, workflow management systems such as the Directed Acyclic Graph Manager (DAGMan) [18] pro- vided by Condor, and Pegasus [4] were considered among others. In the past, the GridW ay metasched- uler has been previously used with good results in many research areas, including Bioinformatics. It natively handles DAG based workflows and allows advanced flow structures like loops or branches. GridW ay of- fers an implementation of both C and JAVA bindings 1