Using SQL for Efficient Generation and Querying of Provenance Information Boris Glavic 1 , Ren´ ee J. Miller 2 , and Gustavo Alonso 3 1 Illinois Institute of Technology bglavic@iit.edu 2 University of Toronto miller@cs.toronto.edu 3 ETH Zurich alonso@inf.ethz.ch Abstract. In applications such as data warehousing or data exchange, the ability to efficiently generate and query provenance information is crucial to understand the origin of data. In this chapter, we review some of the main contributions of Perm, a DBMS that generates different types of provenance information for complex SQL queries (including nested and correlated subqueries and aggregation). The two key ideas behind Perm are representing data and its provenance together in a single relation and relying on query rewrites to generate this representation. Through this, Perm supports fully integrated, on-demand provenance generation and querying using SQL. Since Perm rewrites a query requesting provenance into a regular SQL query and generates easily optimizable SQL code, its performance greatly benefits from the query optimization techniques provided by the underlying DBMS. 1 Introduction Peter Buneman was one of the first to recognize the importance of data prove- nance. With co-authors Khanna and Tan, he introduced two seminal models of Why- and Where-provenance [7]. Provenance, information about the cre- ation process or the origin of data, can be used to debug queries and clean data in data warehouses, to understand and correct complex data integration transformations, for auditing, and to understand the value of data in curated databases. Provenance generation has also been used as a supporting technol- ogy for exchanging updates between heterogeneous databases [21], to provide access control based on the origin of data [31], and in modeling uncertainty in databases [35]. While provenance has many applications, these applications often place very high requirements on a provenance management system to be useful in practice. In this chapter, we overview the contributions of the Perm provenance manage- ment system [17]. Perm was designed as a scalable system for the generation and querying of provenance information over relational data. To understand the requirements for such a system, we begin with an example and then consider the foundations in provenance research on which Perm builds. 1