Future Generation Computer Systems 27 (2011) 775–780
Contents lists available at ScienceDirect
Future Generation Computer Systems
journal homepage: www.elsevier.com/locate/fgcs
Provenance management in Swift
Luiz M.R. Gadelha Jr.
a,b,*
, Ben Clifford
c
, Marta Mattoso
a
, Michael Wilde
c,d
, Ian Foster
c,d
a
Computer and Systems Engineering Program, Federal University of Rio de Janeiro, Brazil
b
National Laboratory for Scientific Computing, Brazil
c
Computation Institute, University of Chicago, USA
d
Mathematics and Computer Science Division, Argonne National Laboratory, USA
article info
Article history:
Received 31 December 2009
Received in revised form
5 May 2010
Accepted 7 May 2010
Available online 20 May 2010
Keywords:
Provenance
Parallel scripting languages
Scientific workflows
abstract
The Swift parallel scripting language allows for the specification, execution and analysis of large-scale
computations in parallel and distributed environments. It incorporates a data model for recording
and querying provenance information. In this article we describe these capabilities and evaluate the
interoperability with other systems through the use of the Open Provenance Model. We describe
Swift’s provenance data model and compare it to the Open Provenance Model. We also describe and
evaluate activities performed within the Third Provenance Challenge, which consisted of implementing a
specific scientific workflow, capturing and recording provenance information of its execution, performing
provenance queries, and exchanging provenance information with other systems. Finally, we propose
improvements to both the Open Provenance Model and Swift’s provenance system.
© 2011 Published by Elsevier B.V.
1. Introduction
The automation of large scale computational scientific experi-
ments can be accomplished through the use of workflow manage-
ment systems [1], parallel scripting tools [2], and related systems
that allow the definition of the activities, input and output data,
and data dependencies of such experiments. The manual analysis
of the data resulting from their execution is usually not feasible,
due to the large amount of information commonly generated by
these experiments. Provenance systems can be used to facilitate
this task, since they gather details about the design [3] and exe-
cution of these experiments, such as data artifacts consumed and
produced by their activities. They also make it easier to reproduce
an experiment for the purpose of verification.
The Open Provenance Model (OPM) [4] is an ongoing effort
to standardize the representation of provenance information. It
defines the entities artifact, process, and agent and the relationships
used (between an artifact and a process), wasGeneratedBy (between
a process and an artifact), wasControlledBy (between an agent
and a process), wasTriggeredBy (between two processes), and
wasDerivedFrom (between two artifacts). These relationships are
used to assert causal dependencies between the entities defined
*
Corresponding author at: Computer and Systems Engineering Program, Federal
University of Rio de Janeiro, Brazil.
E-mail addresses: lgadelha@lncc.br (L.M.R. Gadelha Jr.), benc@hawaga.org.uk
(B. Clifford), marta@cos.ufrj.br (M. Mattoso), wilde@mcs.anl.gov (M. Wilde),
foster@mcs.anl.gov (I. Foster).
in the model. A set of these assertions can be used to build a
provenance graph. One of the main objectives of OPM is to allow
the exchange of provenance information between systems. It also
describes valid inferences that can be made from provenance
graphs. More complex relationships between processes and
artifacts can be derived using, for instance, transitivity.
The Swift parallel scripting system [5,2] is a successor of
the Virtual Data System (VDS) [6–8]. It allows the specification,
management and execution of large-scale scientific workflows on
parallel and distributed environments. The SwiftScript language is
used for high-level specification of computations, it has features
such as data types, data mappers, dataset iteration, conditional
branching, and procedural composition. It allows the manipulation
of datasets in terms of their logical organization. The XML Dataset
Typing and Mapping (XDTM) [9] notation is used to define mappers
between this logical organization and the actual physical structure
of the dataset. Procedures perform logical operations on input data,
without modifying them. SwiftScript also allows procedures to be
composed to define more complex computations. By analyzing the
inputs and outputs of these procedures, the system determines
data dependencies between them. This information is used to
execute procedures that have no mutual data dependencies in
parallel. Swift supports common execution managers for clustered
systems and grid environments, such as Condor [10], GRAM [11],
and PBS [12]. It also supports Falkon [13], an execution engine
that provides high job execution throughput; and SSH [14], for
executing jobs via secure remote logins. Swift logs a variety of
information about each computation. This information can be
exported using tools included in Swift to a relational database
0167-739X/$ – see front matter © 2011 Published by Elsevier B.V.
doi:10.1016/j.future.2010.05.003