Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE
Scientific workflow systems - can one size fit all?
V. Curcin, M. Ghanem
Department of Computing
Imperial College London
180 Queen’s Gate, London SW7 2AZ
Email: vc100@doc.ic.ac.uk, mmg@doc.ic.ac.uk
Abstract—The past decade has witnessed a growing trend in
designing and using workflow systems with a focus on supporting
the scientific research process in bioinformatics and other areas
of life sciences. The aim of these systems is mainly to simplify
access, control and orchestration of remote distributed scientific
data sets using remote computational resources, such as EBI web
services. In this paper we present the state of the art in the field
by reviewing six such systems: Discovery Net, Taverna, Triana,
Kepler, Yawl and BPEL.
We provide a high-level framework for comparing the systems
based on their control flow and data flow properties with a
view of both informing future research in the area by academic
researchers and facilitating the selection of the most appropriate
system for a specific application task by practitioners.
I. I NTRODUCTION
!"#
%&’()"
*+,-.
./01231
43/"1(0/2&3 5("67(&)"%%231
890/0670(0::":;
<(023231 %"/
<"%/ %"/
-&9"::231
=0:290/2&3 ./&(01"
Fig. 1. Workflow example
Informally, a workflow, Figure 1, is an abstract description
of steps required for executing a particular real-world process,
and the flow of information between them. Each step is
defined by a set of activities that need to be conducted.
Within a workflow, work (e.g. data or jobs) passes through
the different steps in the specified order from start to finish,
and the activities at each step are executed either by people
or by system functions (e.g. computer programs). Workflows
are typically authored using a visual front-end or be hard-
coded, and their execution is delegated to a workflow execution
engine that handles the invocation of the remote applications.
Traditionally, workflow systems are split into two broad
families, one for control orchestration of business processes
and the other for functional style computation of data. How-
ever, the requirements of numerous applications do not fit
neatly into either of those categories. This was a rationale
for evolution of scientific workflow systems, that act as mid-
dleware in the scientific research process and typically have
properties of both control and data workflows. Their function
is to abstract over computational and data resources and enable
collaboration between researchers, a task which requires both
aspects. The question we are interested in is whether any single
workflow system (scientific or non-scientific) can be relied on
to cover the scope of requirements from different domains.
This paper approaches the problem by analysing leading
scientific and non-scientific workflow systems, exposing their
handling of control and data constructs, with the view of
informing future research and also facilitating the selection
of the most appropriate system for a specific application task.
As a start, Discovery Net [1] system will be presented
to illustrate the architectural and implementation complexity
associated with a full workflow system. Then, three other main
scientific workflow systems, Taverna [2], Triana [3] and Kepler
[4], will be described, followed by two workflow languages
aiming to be a generic solution across both business and
scientific domains. First of those, YAWL [5] is a theoretical
workflow system based on the Petri Net paradigm that has
been designed to satisfy the full set of workflow patterns,
under the assumption that this will satisfy the needs of both
communities. Second, BPEL [6] is the accepted standard for
business process orchestration, with several attempts being
made to adapt it for use in scientific settings, most notably
by the OMII initiative [7].
II. DISCOVERY NET
The Discovery Net system has been designed around a sci-
entific workflow model for integrating distributed data sources
and analytical tools within a grid computing framework. The
system was originally developed as part of the UK-e-Science
funded project Discovery Net (2001-2005) [8] with the aim of
producing a high-level application-oriented platform, focused
on enabling the end-user scientists in deriving new knowledge
from devices, sensors, databases, analysis components and
computational resources that reside across the Internet or grid.
Its dedicated set of components for data mining has been
used as a basis for numerous cross-domain projects. These
include Life Sciences applications [9], [10], Environmental
Monitoring [11] and Geo-hazard Modelling [12]. Many of
the research ideas developed within the system have also
been incorporated within the InforSense KDE system [13],
a commercial workflow management and data mining system
that has been widely used for business oriented applications. A
number of extensions have been based on the research outputs
of the EU-funded SIMDAT [14] project.
Authorized licensed use limited to: Imperial College London. Downloaded on August 21, 2009 at 04:40 from IEEE Xplore. Restrictions apply.