GProM is a generic provenance database middleware that computes provenance for SQL queries, updates, and transactions on demand.
Provenance tracking for database operations, i.e., automatically collecting and managing information about the origin of data, has received considerable interest from the database community in the last decade. Efficiently generating and querying provenance is essential for debugging data and queries, evaluating trust measures for data, defining new types of access control models, auditing, and as a supporting technology for data integration and probabilistic databases. The de-facto standard for database provenance is to model provenance as annotations on data and compute the provenance for the outputs of an operation by propagating annotations. Many provenance systems use a relational encoding of provenance annotations. These systems apply query rewrite techniques to transform a query q into a query that propagates input annotations to produce the result of q annotated with provenance. This approach has many advantages. It benefits from existing database technology, e.g., provenance computations are optimized by the database optimizer. Queries over provenance can be expressed as SQL queries over the relational encoding. Alternatively, we can compile a special-purpose provenance query language into SQL queries over such an encoding. In this project we advance the current state-of-the-art in several aspects:
- We have developed the first provenance model and capture mechanism for transactional updates.
- We achieve interoperability with other provenance systems by supporting import and export of provenance represented as PROV-JSON
- We have developed a cost-based optimizer for provenance instrumentation pipelines.
- We have build GProM, a generic provenance middleware that supports multiple frontend languages and backend databases.
GProM is a database middleware that adds provenance support to multiple database backends. Provenance is information about how data was produced by database operations. That is, for a row in the database or returned by a query we capture from which rows it was derived and by which operations. The system compiles declarative queries with provenance requests into SQL code and executes this SQL code on a backend database system. GProM supports provenance capture for SQL queries and transactions, and produces provenance graphs explaining existing and missing answers for Datalog queries. Provenance is captured on demand by using a compilation technique called instrumentation. Instrumentation rewrites an SQL query (or past transaction) into a query that returns rows paired with their provenance. The output of the instrumentation process is a regular SQL query that can be executed using any standard relational database. The instrumented query generated from a provenance request returns a standard relation that maps rows to their provenance. Provenance for transactions is captured retroactively using a declarative replay technique called reenactment that we have developed at IIT. GProM extends multiple frontend languages (e.g., SQL and Datalog) with provenance requests and can produce code for multiple backends (currently Oracle). The reenactment approach was developed in collaboration with Oracle as part of the the provenance for temporal databases project. Other noteworthy features of GProM include: support for multiple database backends and an optimizer for rewritten queries.
- Database independent
- Provenance for queries, updates, and transactions
- User decides when to compute and store provenance
- Supports multiple provenance models
- Retroactive on-demand provenance computation using instrumentation and reenactment
- Only requires audit log and time travel
- No additional runtime and storage overhead
- Non-invasive provenance computation using query instrumentation and annotation propagation
- Provenance import/export
- Heuristic and Cost-based optimizations for instrumented queries
Architecture and Approach
An overview of GProM's architecture is shown above. The user interacts
with the system using an extension of one of the supported
Similar to Perm (and other systems) we represent provenance information using a relational encoding of provenance annotations. This representation is flexible enough to encode typical database provenance models including PI-CS (and, thus, provenance polynomials), Where- and Why-provenance, and many others. The provenance rewriter module (2) uses provenance-type specific rules to rewrite an input query \(q\) into a query \(q^+\) that propagates annotations to produce an encoding of data annotated with provenance (we call this process instrumentation.
Supporting Past Queries, Updates, and Transactions
One unique feature of GProM is that the system can retroactively compute the provenance of queries, updates, and transactions. This feature requires that a log of database operations is available (we call this an audit log) and that the underlying database system supports time travel, i.e., querying past versions of a relation. These features are available in most database systems or can be added using extensibility mechanisms. An audit log paired with time travel functionality is sufficient for computing the provenance of past queries using simple modifications of standard provenance rewrites. Our main contribution is to demonstrate that this is also sufficient for tracking the provenance of updates and transactions. If the user requests provenance for a transaction \(T\), the transaction reenactor (3) extracts the list of SQL statements executed by \(T\) from the audit log and constructs a reenactment query \(q(T)\) that simulates the effects of these statements. This query runs over the database version valid at the time when the transaction was executed.
We use the provenance rewriter to rewrite \(q(T)\) into a query \(q(T)^+\) that computes the provenance of the reenacted transaction. Note that the construction of \(q(T)\) is independent of the provenance rewrite and \(q(T)\) is a standard relational query. This is because the reenactment query \(q(T)\) and transaction \(T\) are annotation-equivalent, i.e., they have the same result and provenance. Using this approach, we can compute any type of provenance for updates, transactions, and across transactions as long as rewrite rules for computing the provenance of queries have been implemented for this provenance type.
Optimizing Rewritten Queries
GProM includes an optimizer (7) which applies heuristic and cost-based rules to transform instrumented queries into SQL code that can be successfully optimized by the backend DBMS. This is necessary, because provenance rewrites generate queries with unusual access patterns and operator sequences. Even sophisticated database optimizers are not capable of producing reasonable plans for such queries.
Support for additional database backends can be added to GProM by implementing new parser, catalog lookup, and SQL code generator plugins. Here we benefit from our backend-independent relational algebra graph representation of queries, because all the remaining functionality, e.g., provenance computation, works on the database-independent algebraic representation of queries.Currently we support the following backend databases:
- Oracle: full support
- Postgres: no parser yet.
- SQLite: no parser yet
Provenance Language Extensions
The wiki of the github repository for GProM documents the SQL and Datalog frontend language extensions.
|2018 (to appear)|
|||Heuristic and Cost-based Optimization for Diverse Provenance Tasks In IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018 (to appear) [bibtex] [pdf] [extended version] [doi]|
|||Uncertainty Annotated Databases - A Lightweight Approach for Dealing with Uncertainty Technical report, Illinois Institute of Technology, IIT/CS-DB-2018-01, 2018 [bibtex] [pdf]|
|||Using Reenactment to Retroactively Capture Provenance for Transactions In IEEE Transactions on Knowledge and Data Engineering (TKDE), volume 30, 2018 [bibtex] [pdf] [doi]|
|||GProM - A Swiss Army Knife for Your Provenance Needs In IEEE Data Engineering Bulletin (Data Eng. Bull.), volume 41, 2018 [bibtex] [pdf]|
|||Provenance-aware Query Optimization In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE), 2017 [bibtex] [pdf]|
|||Integrating Approximate Summarization with Provenance Capture In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2017 [bibtex] [pdf]|
|||Debugging Transactions and Tracking their Provenance with Reenactment In Proceedings of the VLDB Endowment (Demonstration Track) (PVLDB), volume 10, 2017 [bibtex] [pdf]|
|||A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE), 2017 [bibtex] [pdf]|
|||Optimizing Provenance Capture and Queries - Algebraic Transformations and Cost-based Optimization Technical report, Illinois Institute of Technology, IIT/CS-DB-2016-02, 2016 [bibtex] [pdf]|
|||Provenance-aware Versioned Dataworkspaces In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2016 [bibtex] [pdf]|
|||Efficiently Computing Provenance Graphs for Queries with Negation Technical report, Illinois Institute of Technology, IIT/CS-DB-2016-03, 2016 [bibtex] [pdf]|
|||Implementing Unified Why- and Why-Not Provenance Through Games In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster) (TaPP), 2016 [bibtex] [pdf]|
|||Reenactment for Read-Committed Snapshot Isolation (long version) Technical report, Illinois Institute of Technology, 2016 [bibtex] [pdf]|
|||Reenactment for Read-Committed Snapshot Isolation In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM), 2016 [bibtex] [pdf]|
|||Formal Foundations of Reenactment and Transaction Provenance Technical report, Illinois Institute of Technology, IIT/CS-DB-2016-01, 2016 [bibtex] [pdf]|
|||Interoperability for Provenance-aware Databases using PROV and JSON In Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2015 [bibtex] [pdf] [slides]|
|||Heuristic and Cost-based Optimization for Provenance Computation In Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (Poster) (TaPP), 2015 [bibtex] [pdf]|
|||An Efficient Implementation of Game Provenance in DBMS Technical report, Illinois Institute of Technology, IIT/CS-DB-2015-02, 2015 [bibtex] [pdf]|
|||Reenacting Transactions to Compute their Provenance Technical report, Illinois Institute of Technology, IIT/CS-DB-2014-02, 2014 [bibtex] [pdf]|
|||A Generic Provenance Middleware for Database Queries, Updates, and Transactions In Proceedings of the 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2014 [bibtex] [pdf] [slides]|