The sheer amount of data available in the Big Data age necessitates the application of techniques such as classification and aggregation that extract meaningful information from this data to be consumed humans. To understand the validity of extracted information and how it was derived from which input data, a human analyst would need to be able to investigate the extraction process and explore which inputs lead to a particular result, i.e., analyze the result’s provenance. In addition to explaining how a result was derived from which input data, provenance information is used for auditing, verification of results, resolving conflicts among data sources, establishing ownership of data, and evaluating data quality. The objective of this project is to make provenance useable for Big Data environments. In this context, we will study the following research questions:
- How to port provenance tracking techniques from relational database to Big Data platforms.
- How to leverage data summarization techniques to create compact representations of provenance that are meaningful to a human. Investigate how these compact descriptions of provenance information can be created without having to store all input data of the extraction process for an indefinite amount of time.
- How to identity which input data items have the most influence on a piece of extracted information.
- Support efficient, interactive exploration of provenance through iterative refinement of condensed and approximate representations.
- Dieter Gawlick - Oracle
- Vasudha Krishnaswamy - Oracle
- Venkatesh Radhakrishnan
- Zhen Hua Liu - Oracle
- Oracle - Provenance for Big Data (2017 - 2018), $90,871, PIs: Boris Glavic
Big Data Provenance: Challenges and Implications for Benchmarking
2nd Workshop on Big Data Benchmarking (2012), pp. 72–80.