Provenance Unification through Graphs with Summarization (PUGS)
Provenance for relational queries records how results of a query depend on the query’s inputs. This type of information can be used to explain why (and how) a result is derived by a query over a given database. Recently, approaches have been developed that use provenance-like techniques to explain why a tuple (or a set of tuples described declaratively by a pattern) is missing from the query result. However, the two problems of computing provenance and explaining missing answers have been treated mostly in isolation.
Unifying Why and Why-not Provenance
In collaboration with UIUC, we have developed a unified approach for efficiently computing provenance (why) and missing answers (why-not). This approach is based on the observation that in provenance model for queries with unsafe negation, why-not questions can be translated into why questions and vice versa.
Typically only a part of the provenance, which we call explanation, is actually relevant for answering the user’s provenance question about the existence or absence of a result. We have developed an approach that tries to restrict provenance capture to what is relevant to explain the outcome of interest specified by the user.
Approximate Summarization of Provenance
While explanations are useful for reducing the size of provenance, the result may still overwhelm users with too much information and waste computational resources. In particular for why-not, where the provenance explains all failed ways of how a result could have been derived using the rules of a query, provenance graphs may be too large to compute even for small datasets. We address the computational and usability challenges of large provenance graphs by creating summaries based on structural commonalities that exist in the provenance. Importantly, our approach computes summaries based on a sample of the provenance only and, thus, avoids the computationally infeasible step of generating the full why-not provenance for a user question. We have implemented these techniques in our PUGS system.
The PUGS System
We have build a system which supports computing explanations for a user's why and why-not provenance question as a, optionally summarized, provenance graph. Our approach offloads the computation that captures relevant provenance and summarizes it to a datalog engine or relational database backend. The figure below gives an overview of the architecture of the proposed approach.
|||PUG: a framework and practical implementation for why and why-not provenance In The VLDB Journal, 2018 [bibtex] [pdf] [extended version] [doi]|
|||Provenance Summaries for Answers and Non-Answers In Proceedings of the VLDB Endowment (Demonstration Track) (PVLDB), volume 11, 2018 [bibtex] [pdf]|
|||Integrating Approximate Summarization with Provenance Capture In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2017 [bibtex] [pdf]|
|||A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE), 2017 [bibtex] [pdf]|
|||Implementing Unified Why- and Why-Not Provenance Through Games In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster) (TaPP), 2016 [bibtex] [pdf]|
|||Towards Constraint-based Explanations for Answers and Non-Answers In Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2015 [bibtex] [pdf] [slides]|