Big Data Provenance: Challenges and Implications for Benchmarking

Authors

Boris Glavic

Materials

pdf
slides

Abstract

Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data - which we refer to as Big Provenance - is a largely unexplored field. This paper reviews existing approaches for large-scale distributed provenance and discusses potential challenges for Big Data benchmarks that aim to incorporate provenance data/management. Furthermore, we will examine how Big Data benchmarking could benefit from different types of provenance information. We argue that provenance can be used for identifying and analyzing performance bottlenecks, to compute performance metrics, and to test a system’s ability to exploit commonalities in data and processing.

bibtex

@inproceedings{G13,
  author = {Glavic, Boris},
  booktitle = {2nd Workshop on Big Data Benchmarking},
  isworkshop = {true},
  keywords = {Big Data; Provenance; Big Provenance},
  pages = {72-80},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/G13.pdf},
  projects = {Big Provenance},
  slideurl = {http://www.slideshare.net/lordPretzel/wbdb-2012-wbdb},
  title = {Big Data Provenance: Challenges and Implications for Benchmarking},
  venueshort = {WBDB},
  year = {2012},
  bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/G13.pdf}
}

Reference

Big Data Provenance: Challenges and Implications for Benchmarking Boris Glavic 2nd Workshop on Big Data Benchmarking (2012), pp. 72–80.