Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE Scientiﬁc workﬂow systems - can one size ﬁt all? V. Curcin, M. Ghanem Department of Computing Imperial College London 180 Queen’s Gate, London SW7 2AZ Email: vc100@doc.ic.ac.uk, mmg@doc.ic.ac.uk Abstract—The past decade has witnessed a growing trend in designing and using workﬂow systems with a focus on supporting the scientiﬁc research process in bioinformatics and other areas of life sciences. The aim of these systems is mainly to simplify access, control and orchestration of remote distributed scientiﬁc data sets using remote computational resources, such as EBI web services. In this paper we present the state of the art in the ﬁeld by reviewing six such systems: Discovery Net, Taverna, Triana, Kepler, Yawl and BPEL. We provide a high-level framework for comparing the systems based on their control ﬂow and data ﬂow properties with a view of both informing future research in the area by academic researchers and facilitating the selection of the most appropriate system for a speciﬁc application task by practitioners. I. I NTRODUCTION !"# %&’()" *+,-. ./01231 43/"1(0/2&3 5("67(&)"%%231 890/0670(0::":; <(023231 %"/ <"%/ %"/ -&9"::231 =0:290/2&3 ./&(01" Fig. 1. Workﬂow example Informally, a workﬂow, Figure 1, is an abstract description of steps required for executing a particular real-world process, and the ﬂow of information between them. Each step is deﬁned by a set of activities that need to be conducted. Within a workﬂow, work (e.g. data or jobs) passes through the different steps in the speciﬁed order from start to ﬁnish, and the activities at each step are executed either by people or by system functions (e.g. computer programs). Workﬂows are typically authored using a visual front-end or be hard- coded, and their execution is delegated to a workﬂow execution engine that handles the invocation of the remote applications. Traditionally, workﬂow systems are split into two broad families, one for control orchestration of business processes and the other for functional style computation of data. How- ever, the requirements of numerous applications do not ﬁt neatly into either of those categories. This was a rationale for evolution of scientiﬁc workﬂow systems, that act as mid- dleware in the scientiﬁc research process and typically have properties of both control and data workﬂows. Their function is to abstract over computational and data resources and enable collaboration between researchers, a task which requires both aspects. The question we are interested in is whether any single workﬂow system (scientiﬁc or non-scientiﬁc) can be relied on to cover the scope of requirements from different domains. This paper approaches the problem by analysing leading scientiﬁc and non-scientiﬁc workﬂow systems, exposing their handling of control and data constructs, with the view of informing future research and also facilitating the selection of the most appropriate system for a speciﬁc application task. As a start, Discovery Net [1] system will be presented to illustrate the architectural and implementation complexity associated with a full workﬂow system. Then, three other main scientiﬁc workﬂow systems, Taverna [2], Triana [3] and Kepler [4], will be described, followed by two workﬂow languages aiming to be a generic solution across both business and scientiﬁc domains. First of those, YAWL [5] is a theoretical workﬂow system based on the Petri Net paradigm that has been designed to satisfy the full set of workﬂow patterns, under the assumption that this will satisfy the needs of both communities. Second, BPEL [6] is the accepted standard for business process orchestration, with several attempts being made to adapt it for use in scientiﬁc settings, most notably by the OMII initiative [7]. II. DISCOVERY NET The Discovery Net system has been designed around a sci- entiﬁc workﬂow model for integrating distributed data sources and analytical tools within a grid computing framework. The system was originally developed as part of the UK-e-Science funded project Discovery Net (2001-2005) [8] with the aim of producing a high-level application-oriented platform, focused on enabling the end-user scientists in deriving new knowledge from devices, sensors, databases, analysis components and computational resources that reside across the Internet or grid. Its dedicated set of components for data mining has been used as a basis for numerous cross-domain projects. These include Life Sciences applications [9], [10], Environmental Monitoring [11] and Geo-hazard Modelling [12]. Many of the research ideas developed within the system have also been incorporated within the InforSense KDE system [13], a commercial workﬂow management and data mining system that has been widely used for business oriented applications. A number of extensions have been based on the research outputs of the EU-funded SIMDAT [14] project. Authorized licensed use limited to: Imperial College London. Downloaded on August 21, 2009 at 04:40 from IEEE Xplore. Restrictions apply.