Scientific Workflow Provenance Metadata Management Using an RDBMS-based RDF Store (Technical Report TR-DB-092007-CFLF, September 2007) Artem Chebotko, Xubo Fei, Shiyong Lu, and Farshad Fotouhi Department of Computer Science Wayne State University 5143 Cass Avenue, Detroit, Michigan 48202, USA {artem, xubo, shiyong, fotouhi}@wayne.edu Abstract. Provenance management has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. This paper pro- poses an approach to provenance management that seamlessly integrates the interoperability, extensibility, and reasoning advantages of Semantic Web technologies with the storage and querying power of an RDBMS. Specifically, we propose: i) two schema mapping algorithms to map an arbitrary OWL provenance ontology to a relational database schema that is optimized for common provenance queries; ii) three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) a schema-independent SPARQL-to-SQL translation algorithm that is op- timized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. While the schema mapping and query transla- tion and optimization algorithms are applicable to general RDF storage and query systems, the data mapping algorithms are optimized for and applicable only to scientific workflow provenance metadata. Moreover, we extend SPARQL with negation, aggregation, and set operations to support additional important provenance queries. Experimental results are presented to show that our algorithms are efficient and scalable. The comparison with existing RDF stores, Jena and Sesame, showed that our optimizations result in improved performance and scalability for prove- nance metadata management. Keywords: provenance, scientific workflow, metadata management, ontol- ogy, RDF, SPARQL-to-SQL translation, query optimization, RDF store. 1 Introduction Today, many significant scientific discoveries are achieved through complex and distributed scientific computations. More and more scientists start to use work-