Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE Scientific workflow systems - can one size fit all? V. Curcin, M. Ghanem Department of Computing Imperial College London 180 Queen’s Gate, London SW7 2AZ Email: vc100@doc.ic.ac.uk, mmg@doc.ic.ac.uk Abstract—The past decade has witnessed a growing trend in designing and using workflow systems with a focus on supporting the scientific research process in bioinformatics and other areas of life sciences. The aim of these systems is mainly to simplify access, control and orchestration of remote distributed scientific data sets using remote computational resources, such as EBI web services. In this paper we present the state of the art in the field by reviewing six such systems: Discovery Net, Taverna, Triana, Kepler, Yawl and BPEL. We provide a high-level framework for comparing the systems based on their control flow and data flow properties with a view of both informing future research in the area by academic researchers and facilitating the selection of the most appropriate system for a specific application task by practitioners. I. I NTRODUCTION !"# %&’()" *+,-. ./01231 43/"1(0/2&3 5("67(&)"%%231 890/0670(0::":; <(023231 %"/ <"%/ %"/ -&9"::231 =0:290/2&3 ./&(01" Fig. 1. Workflow example Informally, a workflow, Figure 1, is an abstract description of steps required for executing a particular real-world process, and the flow of information between them. Each step is defined by a set of activities that need to be conducted. Within a workflow, work (e.g. data or jobs) passes through the different steps in the specified order from start to finish, and the activities at each step are executed either by people or by system functions (e.g. computer programs). Workflows are typically authored using a visual front-end or be hard- coded, and their execution is delegated to a workflow execution engine that handles the invocation of the remote applications. Traditionally, workflow systems are split into two broad families, one for control orchestration of business processes and the other for functional style computation of data. How- ever, the requirements of numerous applications do not fit neatly into either of those categories. This was a rationale for evolution of scientific workflow systems, that act as mid- dleware in the scientific research process and typically have properties of both control and data workflows. Their function is to abstract over computational and data resources and enable collaboration between researchers, a task which requires both aspects. The question we are interested in is whether any single workflow system (scientific or non-scientific) can be relied on to cover the scope of requirements from different domains. This paper approaches the problem by analysing leading scientific and non-scientific workflow systems, exposing their handling of control and data constructs, with the view of informing future research and also facilitating the selection of the most appropriate system for a specific application task. As a start, Discovery Net [1] system will be presented to illustrate the architectural and implementation complexity associated with a full workflow system. Then, three other main scientific workflow systems, Taverna [2], Triana [3] and Kepler [4], will be described, followed by two workflow languages aiming to be a generic solution across both business and scientific domains. First of those, YAWL [5] is a theoretical workflow system based on the Petri Net paradigm that has been designed to satisfy the full set of workflow patterns, under the assumption that this will satisfy the needs of both communities. Second, BPEL [6] is the accepted standard for business process orchestration, with several attempts being made to adapt it for use in scientific settings, most notably by the OMII initiative [7]. II. DISCOVERY NET The Discovery Net system has been designed around a sci- entific workflow model for integrating distributed data sources and analytical tools within a grid computing framework. The system was originally developed as part of the UK-e-Science funded project Discovery Net (2001-2005) [8] with the aim of producing a high-level application-oriented platform, focused on enabling the end-user scientists in deriving new knowledge from devices, sensors, databases, analysis components and computational resources that reside across the Internet or grid. Its dedicated set of components for data mining has been used as a basis for numerous cross-domain projects. These include Life Sciences applications [9], [10], Environmental Monitoring [11] and Geo-hazard Modelling [12]. Many of the research ideas developed within the system have also been incorporated within the InforSense KDE system [13], a commercial workflow management and data mining system that has been widely used for business oriented applications. A number of extensions have been based on the research outputs of the EU-funded SIMDAT [14] project. Authorized licensed use limited to: Imperial College London. Downloaded on August 21, 2009 at 04:40 from IEEE Xplore. Restrictions apply.