Data provenance is information about the origin and creation process of data. The provenance of data item, an atomic piece of data, may include source and intermediate data as well as the transformations involved in producing a concrete data item. In the context of relational databases, the source and intermediate data items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Provenance is of immense importance in a variety of application domains. It can be used to debug queries and clean data in data warehouses, to understand and correct complex data integration transformations, and to understand the value of data in curated databases. Provenance generation has also been used as supporting technology for exchanging updates between heterogeneous databases and in modeling uncertainty in databases. While provenance has many applications, these applications often place very high requirements on a provenance management system to be useful in practice.
Unfortunately, few practical relational provenance systems exist and most support only small subsets of SQL. To address this gap, we have developed Perm (Provenance Extension of the Relational Model), a DBMS that generates different types of provenance information for complex SQL queries (e.g., we support nested and correlated subqueries and aggregation). The two key ideas behind Perm are representing data and its provenance together in a single relation and relying on query rewrites to generate this representation. Through this, Perm supports fully integrated, on-demand provenance generation and querying using SQL. Since Perm rewrites a query requesting provenance into a regular SQL query and generates easily optimizable SQL code, its performance greatly benefits from the query optimization techniques provided by the underlying DBMS. We have released the system as open source on https://github.com/IITDBGroup/permGitHub.
Perm represents provenance information as relations that are generated and queried on-demand using standard SQL queries. If the users requests one of the provenance types supported by Perm for a query q using the SQL-PLE language extension, the system transforms q into a query that returns the provenance of q in addition to the regular results of q. Perm supports the provenance shown in the table below.
As an example for the type of provenance generated by Perm consider the query and database shown below (For convenience, we shown an identifier for each tuple at the left of the tuple). This query joins a Customer with an Address relation to return customers names and associated cities. Each tuple in the result of this query is produced by joining a tuple from relation Customer with a tuple from relation Address. For example, tuple \(t_1\) is derived from tuples \(c_1\) and \(a_1\). If we use Perm to generate the provenance of this query, the result would be the relation shown below. Each tuple in this relation is a tuple from the original query result augmented with tuples from its provenance. For example, \(t_1\) is paired with \(c_1\) and \(a_1\) and \(t_2\) is paired with \(c_2\) and \(a_2\).
An overview of the main features of Perm are:
- Support for different types of provenance:
- Why-Provenance: Perm-Influence Contribution Semantics (PI-CS)
- Where-Provenance: Copy Contribution Semantics (C-CS), Where-Provenance (Buneman et al.)
- How-Provenance: Provenance Polynomials
- Transformation Provenance
- Provenance generation for complex SQL Queries:
- Select-Project-Join (SPJ)
- Set Operations
- Nested and Correlated Subqueries
- Relational Representation of Provenance
- SQL Language Extension (SQL-PLE) for On-demand Provenance Generation
- Query Rewrite Based Computation of Provenance inside the DBMS
Relational Provenance Representation
Perm represents the provenance of a query q as a single relation that contains both the original query results of q and its provenance. Provenance information is attached to a query result tuple by extending the tuple with additional attributes that are used to store provenance information. Regular result tuples are duplicated if necessary to “fit in” their complete provenance. PI-CS and C-CS, the two data provenance semantics developed for Perm, represent provenance as so-called witness lists. A witness-list is a list of input tuples to a query that were used together to derive an output tuple; one from each input relation of the query (leaves of the algebra tree) or the special value \(\perp\) which indicates that no tuple from the relation at this position contributed. The relational representation of PI-CS and C-CS appends all attributes from the relations accessed by the query to the query’s result schema. The additional attributes in the provenance representation are used to extend a result tuple with all tuples from one of its witness lists. Thus, tuples with more than one witness list in their provenance are duplicated and each duplicate is paired with the relational encoding of one witness list. To distinguish between regular result attributes and provenance attributes, the later are identified by a prefix and the name of the relation they are derived from (adding a distinguishing identifier for relations that are accessed more than once by the query). The special value \(\perp\) used in witness lists is modeled as NULL values in the representation.
Consider the query and example database shown above. The PI-CS provenance is shown below. Provenance attribute names for PI-CS are given in a separate table to simplify the example. Tuple \(t_2\) in the result of the query was derived by joining tuple \(u_2\) with tuples \(c_2\) and \(c_3\). Thus, the PI-CS provenance of tuple \(t_2\) consists of two witness lists \(<u_2,c_2,\perp>\) and \(<u_2,c_3,\perp>\). These are represented as two tuples in the relational representation by duplicating \(t_2\) and pairing each duplicate with the tuples from one of the witness lists.
The figure on the right shows the transformation provenance for tuples \(t_1\) and \(t_5\). Tuple \(t_1\) is derived from the left input of the union without any influence from its right input. Therefore, the right input of the union is enclosed in a NOT tag in the transformations provenance of \(t_1\). Similarly, tuple \(t_5\) was produced by the right branch of the union. Thus, the left branch is enclosed in NOT.
Query Rewrite Techniques
The research underlying Perm has demonstrated that SQL is powerful enough to express the computation of provenance for a large subset of queries expressible in SQL. The approach supports aggregations, set operations, nested or correlated subqueries, and user-defined functions. We do not support non-deterministic functions that return different results for the same input in the scope of one query. For example, a random number generator is a non-deterministic function.
Requesting the provenance of a query \(q\) through the system's SQL extensions (see below) instructs Perm to rewrite \(q\) into a standard SQL query that returns one type of provenance for \(q\) using the provenance representation introduced above. The query rewrites for each provenance type were developed following the process shown in the Figure to the right. (1) We state a provenance type's semantics as a declarative definition and define a relational representation. This approach was chosen because correctness criteria one would intuitively expect to hold for provenance are easily stated declaratively. For instance, for data provenance, the provenance of a tuple \(t\) from the result of a query \(q\) should contain sufficient information to produce the tuple \(t\). (2) From the declarative definition we derive algebraic rewrites which transform a query into a provenance-generating query and prove their correctness. (3) A canonical translation is applied to translate the algebraic rewrites into SQL rewrites.
The seamless integration of provenance generation as an SQL language feature has many advantages. We can provide full SQL query support for provenance information. The rewrite rules are unaware of how the provenance attributes of their input were produced. Thus, they can be used to propagate provenance information that was created manually or by another provenance management system. A query over provenance data is implemented as a regular SQL query with a subquery that implements the provenance generation. Thus, we fully utilize the DBMS optimizer to speed up provenance computation by, e.g., pushing selections and projections applied by a query into the provenance generation. Since optimizing provenance generation is still in its infancy, this is a feasible approach for efficient provenance generation and querying (e.g., we can efficiently compute the PI-CS provenance of the TCP-H benchmark queries for a 1GB TCP-H instance).
SQL Language Extension (SQL-PLE)
The provenance language extension (SQL-PLE) of Perm enriches SQL with additional keywords to request provenance, control how far to trace provenance, and to inform the system about existing provenance information.
The query \(q\) shown above runs over the schema introduced in the section about provenance representation. This query returns the months during which customers with at least two credit cards exceeded their credit limit on some card. To understand from which inputs of \(q\) the result tuple \(t_2\) (Joe,Feb) is derived, a user needs access to the data provenance of the query and the ability to query this information. For example, a user may be interested in knowing if some of these over-drafts are caused by suspiciously low credit card limits. This question can be answered by running a query over the provenance of \(q\) to retrieve tuples in the result of \(q\) that depend on credit card tuples with low limit values (i.e., these credit card tuples belong to the data provenance of the tuples to be returned). Alternatively, if the user is interested in where the name attribute values in the query result have been copied from, a different type of provenance is needed that tracks copying of information instead of which inputs caused a tuple to appear in the query result.
The keyword PROVENANCE is employed in the SELECT clause of a query \(q\) to instruct Perm to compute the provenance of \(q\). An optional ON CONTRIBUTION modifier is used to choose the provenance type that is produced (PI-CS is the default). For example, the query below returns the PI-CS provenance of the query presented above.
Note that all original SQL features provided by PostgreSQL are not affected by the language extension, and even more important, they can be used in combination with provenance computation. Given the provenance representation of Perm this enables complex queries that filter provenance based on properties of the input tuples in the provenance and/or the result tuples of the query.
Assume the user expected the running example query to return less credit over-drafts. Her assumption is that some over-drafts are caused by credit card limits which have been recorded too low. The user runs the following query to determine which over-drafts are caused by (have tuples in their provenance with) suspiciously low credit card limits (say $500):
The default behaviour is to generate the provenance of a complete query by tracing which tuples in the query's input affected which tuples in the query result. Perm also supports limiting the provenance generation to parts of a query to trace the effect of intermediate query results instead of the input relations. The keyword BASERELATION is appended to an item in the FROM clause to limit how far back the provenance is computed.
Retrieving the full provenance of the running example query may return a large number of tuples, because each aggregated monthly amount (subquery monthly) can depend on a large number of individual purchases. Questions like the one from the initial example can be answered without information about the influence of each individual purchase tuple. The user can mark the subquery monthly with the BASERELATION keyword to only investigate the effect of the aggregated monthly amounts.
Perm can handle existing provenance information that was not produced by the system itself as long as (1) it is stored in additional attributes of tuples following the representation used by Perm and (2) the system is made aware of which attributes store provenance information (by appending the keyword PROVENANCE followed by a list of attribute names to the FROM-clause item).
The imports relation from the running example stores from which data sources each purchase tuple is imported. This is a kind of provenance information for the purchase relation. Joining the imports relation with the purchase relation and using the PROVENANCE keyword in the FROM clause, the user makes Perm aware of the existence of the additional provenance data. The system will treat this provenance in the same way as provenance generated by the system itself. The modified example query is shown below.
Perm is implemented as a modified PostgreSQL engine, extending its SQL dialect with provenance features. Provenance generation in Perm is light-weight and lazy: no provenance is generated unless explicitly requested. Thus, if the provenance features of Perm are not used, the system behaves like a normal Postgres server - clients will observe no overhead in runtime or storage space. The figure below shows the architecture of the system. The parser and analyzer module of PostgreSQL (extended to recognize SQL-PLE) parse incoming SQL queries and transform them into an internal tree representation. The output of the analyzer module is passed to the Perm rewrite module. This module implements the query rewrite rules as transformations on query trees. The rewritten query tree produced by the Perm module is handed over to the original Postgres optimizer. From the optimizer’s point of view the input it retrieves is a regular SQL query.
The Perm Browser is a client application that was developed for demonstration purposes. The browser is written in Java and connects to a Perm server instance though JDBC. Given an input query from the user, the browser will show the results of the query, the SQL text for the rewritten query, algebra trees for both the original and rewritten query, and statistics about the execution of the original and rewritten query (e.g., number of result tuples, execution time, ...). A screenshot of the application is shown below.
- Gustavo Alonso - Professor at ETH Zurich Systems Group
- Renée J. Miller - Professor at the University of Toronto Database Group
|||Using SQL for Efficient Generation and Querying of Provenance Information , In In search of elegance in the theory and practice of computation: a Festschrift in honour of Peter Buneman (Festschrift Peter Buneman), 2013. [bibtex] [pdf]|
|||Perm: Efficient Provenance Support for Relational Databases , (PhD Thesis), PhD thesis, University of Zurich, 2010. [bibtex] [pdf]|
|||The Perm Provenance Management System in Action , In Proceedings of the 35th ACM SIGMOD International Conference on Management of Data (Demonstration Track) (SIGMOD), 2009. [bibtex] [pdf]|
|||Provenance for Nested Subqueries , In Proceedings of the 12th International Conference on Extending Database Technology (EDBT), 2009. [bibtex] [pdf] [slides]|
|||Perm: Processing Provenance and Data on the same Data Model through Query Rewriting , In Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE), 2009. [bibtex] [pdf] [slides]|