BIOINFORMATICS Vol. 19 Suppl. 1 2003, pages i225–i231 DOI: 10.1093/bioinformatics/btg1031 The discovery net system for high throughput bioinformatics Anthony Rowe 1, ∗ , Dimitrios Kalaitzopoulos 1, 2 , Michelle Osmond 1 , Moustafa Ghanem 1 and Yike Guo 1 1 Department of Computing Imperial College, 180 Queens Gate, London, SW7 2RH, UK and 2 The Wellcome Trust Sanger Institute, Hinxton, Cambs, CB10 1SA, UK Received on January 6, 2003; accepted on February 20, 2003 ABSTRACT Motivation: Bioinformatics requires Grid technologies and protocols to build high performance applications without focusing on the low level detail of how the individual Grid components operate. Result: The Discovery Net system is a middleware that allows service developers to integrate tools based on existing and emerging Grid standards such as web services. Once integrated, these tools can be used to compose reusable workflows using these services that can later be deployed as new services for others to use. Using the Discovery Net system and a range of different bioinformatics tools, we built a Grid based application for Genome Annotation. This includes workflows for automatic nucleotide annotation, annotation of predicted proteins and text analysis based on metabolic profiles and text analysis. Contact: asr99@doc.ic.ac.uk Keywords: grid, E-Science, annotation, workflow, pipeline. INTRODUCTION Current research into fundamental Grid technologies, such as Globus (Foster and Kesselman, 1997), has concentrated on the provision of protocols, services and tools for creat- ing co-ordinated, transparent and secure globally acces- sible computational systems. These technologies follow a service methodology for finding both computation and data services for performing computationally or data in- tensive tasks. The delivery of the low-level infrastructure is essential but does not aid end users in the creation of ap- plications that use all of the services the Grid has to offer. The Discovery Net project (Curcin et al., 2002) has de- veloped from the need for a higher-level layer of infor- matics middleware to allow scientists to create meaning- ful data analysis processes and then execute them using an underlying Grid infrastructure without being aware of the ∗ To whom correspondence should be addressed. protocol used by individual services. The Discovery Net system builds on top of the fundamental Grid technolo- gies to provide a bridge between the end user of a Grid service and the developers of individual Grid tools. Using the various tools produced as part of Discovery Net, generating a reusable Grid application becomes the task of selecting the required components and services and connecting them into a process. This is based around an XML-based language Discovery Process Mark-up Language (DPML) (Syed et al., 2002). A process created in DPML is reusable and can then be encapsulated and shared as a new service on the Grid for other scientists. As part of the Discovery Net project we have developed a number of case studies based on bioinformatics appli- cations including genome and protein annotation. These applications have been developed to provide working models of how Grid applications can be used and to de- velop an understanding of the requirements of distributed heterogeneous scientific applications working in Grid environments. DISCOVERY NET ARCHITECTURE The aim of the Discovery Net project is to provide middleware technology to allow users to create knowledge discovery applications that use Grid-based resources. In effect, the project aims to provide a bridge between a scientific community performing analysis, such as the molecular biology community and the high-performance computing community who create the underlying Grid services. The methodology behind the Discovery Net is the development of data flow processes or pipelines that represent the transformation of data from one form to another using a variety of different services. The Discovery Net system is designed primarily to support analysis of scientific data based on a workflow or pipeline methodology. In this framework, services can be treated as black boxes with known input and output interfaces. Services are then connected together into a sequence of operations. Typical services include: database access, clustering, homology searching or notification. Bioinformatics 19(Suppl. 1) c Oxford University Press 2003; all rights reserved. i225