Provenance Traces of the Swift Parallel Scripting System Luiz M. R. Gadelha Jr. 1 , Michael Wilde 2,3 , Marta Mattoso 4 , Ian Foster 2,3,5 1 National Laboratory for Scientific Computing, Brazil 2 Mathematics and Computer Science Division, Argonne National Laboratory, USA 3 Computation Institute, Argonne National Laboratory and University of Chicago, USA 4 Computer and Systems Engineering Program, Federal University of Rio de Janeiro, Brazil 5 Department of Computer Science, University of Chicago, USA lgadelha@lncc.br, wilde@mcs.anl.gov, marta@cos.ufrj.br, foster@anl.gov ABSTRACT In this abstract, we describe provenance traces generated from executions of scientific workflows managed by the Swift parallel scripting system. They follow a provenance data model, used by MTCProv, the provenance management com- ponent of Swift. It is similar to PROV, representing most of its core concepts and including additional information about the scientific domain, computational resource con- sumption, and prospective provenance. We describe prove- nance queries that follow patterns commonly found in high performance computing and that are straightforward to sup- port with MTCProv’s built-in procedures. These queries of- ten involve costly relational join operations and recursion, providing a relevant case for benchmarking. 1. INTRODUCTION The Swift parallel scripting system [7] allows for specify- ing, executing and analyzing scientific workflows given by many computational tasks. Its execution engine supports environments commonly found in parallel and distributed systems. It is highly scalable and has been used for run- ning large-scale scientific computations [6]. Swift optionally produces provenance information in its log files, and this in- formation has been exported to relational databases using a data model [1] similar to OPM [4] and PROV [5]. MTCProv [3] extends provenance management in Swift by recording prospective provenance, annotations, and runtime resource consumption. It has a query interface containing built-in procedures that abstract commonly used provenance queries [2] and automatically computes relational join expressions. 2. SUMMARY OF SUBMISSION We have executed and generated provenance traces for the following scientific workflows: examples and tutorials that come in the standard distribution of Swift; demon- stration scientific workflows for processing satellite images from MODIS and for ray tracing with C-Ray; and a scien- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright is held by the author/owner(s). EDBT/ICDT’13, Mar 18-22 2013, Genoa, Italy Copyright 2013 ACM 978-1-4503-1599-9/13/03 ...$10.00. tific workflow for parallelizing BLAST sequence alignments through database partitioning. Table 1 has a summary of our provenance trace submission, which is basically given by SQL statements that create a relational database containing both prospective and retrospective provenance of the scien- tific workflows mentioned. We are using our own format for representing provenance, since it stores prospective prove- nance, which is not modeled by PROV. Data format SQL Data model Relational (MTCProv [3]) Size 10.4MB Tools used for generat- ing provenance Swift [7], MTCProv [3] Submission group Swift Contact swift-user@ci.uchicago.edu License Creative Commons Attribution- Share Alike 3.0 Unported License Table 1: Summary of submission. 3. EXPERIENCE STATEMENT The basic process of generating the provenance traces con- sisted of enabling a configuration option in Swift for prove- nance logging, which adds lines containing provenance in- formation to the log file associated to the execution of a scientific workflow. These log files were processed, a poste- riori, by MTCProv to extract relevant provenance informa- tion and insert it into the provenance database. A dump of the provenance database was generated for submission. 4. APPLICATION The provenance traces submitted can support the evalua- tion of provenance management systems for parallel and dis- tributed scientific workflows. They capture core provenance information, such as prov:used and prov:wasGeneratedBy relationships, and additional information that scientists are usually interested in querying, such as science domain anno- tations and computational resource consumption by work- flow activities. Other issues of parallelism and distribution are also captured, such as an activity invocation having possibly many execution attempts, due to failures; redun- dant submission of computational tasks to improve execu- tion throughput. Some queries involve performing many re- lational join operations and recursion, providing a relevant case for benchmarking provenance management systems.