Future Generation Computer Systems 27 (2011) 775–780 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Provenance management in Swift Luiz M.R. Gadelha Jr. a,b,* , Ben Clifford c , Marta Mattoso a , Michael Wilde c,d , Ian Foster c,d a Computer and Systems Engineering Program, Federal University of Rio de Janeiro, Brazil b National Laboratory for Scientific Computing, Brazil c Computation Institute, University of Chicago, USA d Mathematics and Computer Science Division, Argonne National Laboratory, USA article info Article history: Received 31 December 2009 Received in revised form 5 May 2010 Accepted 7 May 2010 Available online 20 May 2010 Keywords: Provenance Parallel scripting languages Scientific workflows abstract The Swift parallel scripting language allows for the specification, execution and analysis of large-scale computations in parallel and distributed environments. It incorporates a data model for recording and querying provenance information. In this article we describe these capabilities and evaluate the interoperability with other systems through the use of the Open Provenance Model. We describe Swift’s provenance data model and compare it to the Open Provenance Model. We also describe and evaluate activities performed within the Third Provenance Challenge, which consisted of implementing a specific scientific workflow, capturing and recording provenance information of its execution, performing provenance queries, and exchanging provenance information with other systems. Finally, we propose improvements to both the Open Provenance Model and Swift’s provenance system. © 2011 Published by Elsevier B.V. 1. Introduction The automation of large scale computational scientific experi- ments can be accomplished through the use of workflow manage- ment systems [1], parallel scripting tools [2], and related systems that allow the definition of the activities, input and output data, and data dependencies of such experiments. The manual analysis of the data resulting from their execution is usually not feasible, due to the large amount of information commonly generated by these experiments. Provenance systems can be used to facilitate this task, since they gather details about the design [3] and exe- cution of these experiments, such as data artifacts consumed and produced by their activities. They also make it easier to reproduce an experiment for the purpose of verification. The Open Provenance Model (OPM) [4] is an ongoing effort to standardize the representation of provenance information. It defines the entities artifact, process, and agent and the relationships used (between an artifact and a process), wasGeneratedBy (between a process and an artifact), wasControlledBy (between an agent and a process), wasTriggeredBy (between two processes), and wasDerivedFrom (between two artifacts). These relationships are used to assert causal dependencies between the entities defined * Corresponding author at: Computer and Systems Engineering Program, Federal University of Rio de Janeiro, Brazil. E-mail addresses: lgadelha@lncc.br (L.M.R. Gadelha Jr.), benc@hawaga.org.uk (B. Clifford), marta@cos.ufrj.br (M. Mattoso), wilde@mcs.anl.gov (M. Wilde), foster@mcs.anl.gov (I. Foster). in the model. A set of these assertions can be used to build a provenance graph. One of the main objectives of OPM is to allow the exchange of provenance information between systems. It also describes valid inferences that can be made from provenance graphs. More complex relationships between processes and artifacts can be derived using, for instance, transitivity. The Swift parallel scripting system [5,2] is a successor of the Virtual Data System (VDS) [6–8]. It allows the specification, management and execution of large-scale scientific workflows on parallel and distributed environments. The SwiftScript language is used for high-level specification of computations, it has features such as data types, data mappers, dataset iteration, conditional branching, and procedural composition. It allows the manipulation of datasets in terms of their logical organization. The XML Dataset Typing and Mapping (XDTM) [9] notation is used to define mappers between this logical organization and the actual physical structure of the dataset. Procedures perform logical operations on input data, without modifying them. SwiftScript also allows procedures to be composed to define more complex computations. By analyzing the inputs and outputs of these procedures, the system determines data dependencies between them. This information is used to execute procedures that have no mutual data dependencies in parallel. Swift supports common execution managers for clustered systems and grid environments, such as Condor [10], GRAM [11], and PBS [12]. It also supports Falkon [13], an execution engine that provides high job execution throughput; and SSH [14], for executing jobs via secure remote logins. Swift logs a variety of information about each computation. This information can be exported using tools included in Swift to a relational database 0167-739X/$ – see front matter © 2011 Published by Elsevier B.V. doi:10.1016/j.future.2010.05.003