IIT Database Group

Fork me on GitHub

GProM

GProM is a generic provenance database middleware that computes provenance for SQL queries, updates, and transactions on demand.

Provenance tracking for database operations, i.e., automatically collecting and managing information about the origin of data, has received considerable interest from the database community in the last decade. Efficiently generating and querying provenance is essential for debugging data and queries, evaluating trust measures for data, defining new types of access control models, auditing, and as a supporting technology for data integration and probabilistic databases. The de-facto standard for database provenance is to model provenance as annotations on data and compute the provenance for the outputs of an operation by propagating annotations. Many provenance systems use a relational encoding of provenance annotations. These systems apply query rewrite techniques to transform a query q into a query that propagates input annotations to produce the result of q annotated with provenance. This approach has many advantages. It benefits from existing database technology, e.g., provenance computations are optimized by the database optimizer. Queries over provenance can be expressed as SQL queries over the relational encoding. Alternatively, we can compile a special-purpose provenance query language into SQL queries over such an encoding. In this project we advance the current state-of-the-art in several aspects:

• We have developed the first provenance model and capture mechanism for transactional updates.
• We achieve interoperability with other provenance systems by supporting import and export of provenance represented as PROV-JSON
• We have developed a cost-based optimizer for provenance instrumentation pipelines.
• We have build GProM, a generic provenance middleware that supports multiple frontend languages and backend databases.

GProM is a database middleware that adds provenance support to multiple database backends. Provenance is information about how data was produced by database operations. That is, for a row in the database or returned by a query we capture from which rows it was derived and by which operations. The system compiles declarative queries with provenance requests into SQL code and executes this SQL code on a backend database system. GProM supports provenance capture for SQL queries and transactions, and produces provenance graphs explaining existing and missing answers for Datalog queries. Provenance is captured on demand by using a compilation technique called instrumentation. Instrumentation rewrites an SQL query (or past transaction) into a query that returns rows paired with their provenance. The output of the instrumentation process is a regular SQL query that can be executed using any standard relational database. The instrumented query generated from a provenance request returns a standard relation that maps rows to their provenance. Provenance for transactions is captured retroactively using a declarative replay technique called reenactment that we have developed at IIT. GProM extends multiple frontend languages (e.g., SQL and Datalog) with provenance requests and can produce code for multiple backends (currently Oracle). The reenactment approach was developed in collaboration with Oracle as part of the the provenance for temporal databases project. Other noteworthy features of GProM include: support for multiple database backends and an optimizer for rewritten queries.

System Highlights

• Database independent
• Provenance for queries, updates, and transactions
• User decides when to compute and store provenance
• Supports multiple provenance models
• Retroactive on-demand provenance computation using instrumentation and reenactment
• Only requires audit log and time travel
• Non-invasive provenance computation using query instrumentation and annotation propagation
• Provenance import/export
• Heuristic and Cost-based optimizations for instrumented queries

Architecture and Approach

An overview of GProM's architecture is shown above. The user interacts with the system using an extension of one of the supported frontend languages (currently SQL and Datalog). Specifically, we support new language constructs for capturing and managing provenance (similar to Perm). Incoming statements are translated into a relational algebra graph representation which we call (1). Similar to intermediate code representations used by compilers, we use relational algebra as an intermediate representation of computation which is independent of the target language. If the statement does not use any provenance features, then the algebra graph is translated back into the declarative language understood by the backend, e.g., we support several native SQL dialects using vendor specific SQL code generators (7). If the input query uses one of the provenance extensions supported by our system, then several instrumentation modules may get involved to serve the user request.

Provenance Computation

Similar to Perm (and other systems) we represent provenance information using a relational encoding of provenance annotations. This representation is flexible enough to encode typical database provenance models including PI-CS (and, thus, provenance polynomials), Where- and Why-provenance, and many others. The provenance rewriter module (2) uses provenance-type specific rules to rewrite an input query $q$ into a query $q^+$ that propagates annotations to produce an encoding of data annotated with provenance (we call this process instrumentation.

Supporting Past Queries, Updates, and Transactions

One unique feature of GProM is that the system can retroactively compute the provenance of queries, updates, and transactions. This feature requires that a log of database operations is available (we call this an audit log) and that the underlying database system supports time travel, i.e., querying past versions of a relation. These features are available in most database systems or can be added using extensibility mechanisms. An audit log paired with time travel functionality is sufficient for computing the provenance of past queries using simple modifications of standard provenance rewrites. Our main contribution is to demonstrate that this is also sufficient for tracking the provenance of updates and transactions. If the user requests provenance for a transaction $T$, the transaction reenactor (3) extracts the list of SQL statements executed by $T$ from the audit log and constructs a reenactment query $q(T)$ that simulates the effects of these statements. This query runs over the database version valid at the time when the transaction was executed.

We use the provenance rewriter to rewrite $q(T)$ into a query $q(T)^+$ that computes the provenance of the reenacted transaction. Note that the construction of $q(T)$ is independent of the provenance rewrite and $q(T)$ is a standard relational query. This is because the reenactment query $q(T)$ and transaction $T$ are annotation-equivalent, i.e., they have the same result and provenance. Using this approach, we can compute any type of provenance for updates, transactions, and across transactions as long as rewrite rules for computing the provenance of queries have been implemented for this provenance type.

Optimizing Rewritten Queries

GProM includes an optimizer (7) which applies heuristic and cost-based rules to transform instrumented queries into SQL code that can be successfully optimized by the backend DBMS. This is necessary, because provenance rewrites generate queries with unusual access patterns and operator sequences. Even sophisticated database optimizers are not capable of producing reasonable plans for such queries.

Database Backends

Support for additional database backends can be added to GProM by implementing new parser, catalog lookup, and SQL code generator plugins. Here we benefit from our backend-independent relational algebra graph representation of queries, because all the remaining functionality, e.g., provenance computation, works on the database-independent algebraic representation of queries.

Provenance Language Extensions

The wiki of the github repository for GProM documents the SQL and Datalog frontend language extensions.

Publications

1. Heuristic and Cost-based Optimization for Diverse Provenance Tasks
Xing Niu, Raghav Kapoor, Boris Glavic, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy and Venkatesh Radhakrishnan
IEEE Transactions on Knowledge and Data Engineering. 31, 7 (2019) , 1267–1280.
details
Bahareh Arab
Illinois Institue of Technology.
details
3. GProM - A Swiss Army Knife for Your Provenance Needs
Bahareh Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu and Qitian Zeng
IEEE Data Engineering Bulletin. 41, 1 (2018) , 51–62.
details
4. Using Reenactment to Retroactively Capture Provenance for Transactions
Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan and Boris Glavic
IEEE Transactions on Knowledge and Data Engineering. 30, 3 (2018) , 599–612.
details
5. Debugging Transactions and Tracking their Provenance with Reenactment
Xing Niu, Boris Glavic, Seokki Lee, Bahareh Arab, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy, Su Feng and Xun Zou
Proceedings of the VLDB Endowment (Demonstration Track). 10, 12 (2017) , 1857–1860.
details
6. A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 33rd IEEE International Conference on Data Engineering (2017), pp. 485–496.
details
7. Integrating Approximate Summarization with Provenance Capture
Seokki Lee, Xing Niu, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (2017).
details
8. Provenance-aware Query Optimization
Xing Niu, Raghav Kapoor, Boris Glavic, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy and Venkatesh Radhakrishnan
Proceedings of the 33rd IEEE International Conference on Data Engineering (2017), pp. 473–484.
details
9. Optimizing Provenance Capture and Queries - Algebraic Transformations and Cost-based Optimization
Xing Niu and Boris Glavic
Technical Report #IIT/CS-DB-2016-02
Illinois Institute of Technology.
details
10. Formal Foundations of Reenactment and Transaction Provenance
Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan and Boris Glavic
Technical Report #IIT/CS-DB-2016-01
Illinois Institute of Technology.
details
11. Provenance-aware Versioned Dataworkspaces
Xing Niu, Bahareh Arab, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy, Oliver Kennedy and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (2016).
details
12. Reenactment for Read-Committed Snapshot Isolation
Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan and Boris Glavic
Proceedings of the 25th ACM International Conference on Information and Knowledge Management (2016), pp. 841–850.
details
13. Reenactment for Read-Committed Snapshot Isolation (long version)
Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan and Boris Glavic
Illinois Institute of Technology.
details
14. Efficiently Computing Provenance Graphs for Queries with Negation
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Technical Report #IIT/CS-DB-2016-03
Illinois Institute of Technology.
details
15. Implementing Unified Why- and Why-Not Provenance Through Games
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster) (2016).
details
16. An Efficient Implementation of Game Provenance in DBMS
Seokki Lee, Yuchen Tang, Sven Köhler, Bertram Ludäscher and Boris Glavic
Technical Report #IIT/CS-DB-2015-02
Illinois Institute of Technology.
details
17. Heuristic and Cost-based Optimization for Provenance Computation
Xing Niu, Raghav Kapoor and Boris Glavic
Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (Poster) (2015).
details
18. Interoperability for Provenance-aware Databases using PROV and JSON
Xing Niu, Raghav Kapoor, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy, Venkatesh Radhakrishnan and Boris Glavic
Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (2015).
details
19. A Generic Provenance Middleware for Database Queries, Updates, and Transactions
Bahareh Arab, Dieter Gawlick, Venkatesh Radhakrishnan, Hao Guo and Boris Glavic
Proceedings of the 6th USENIX Workshop on the Theory and Practice of Provenance (2014).
details
20. Reenacting Transactions to Compute their Provenance
Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan and Boris Glavic
Technical Report #IIT/CS-DB-2014-02
Illinois Institute of Technology.
details