IIT Database Group

header bar

To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data

Authors

Materials

Abstract

Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven useful, e.g., to debug complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach to produce query-based explanations. It is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting, projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation on Spark, we demonstrate that our approach is the first to scale to large datasets while often finding explanations that existing techniques fail to identify.

bibtex

@inproceedings{DL21,
  author = {Diestelk{\"a}mper, Ralf and Lee, Seokki and Herschel, Melanie and Glavic, Boris},
  booktitle = {Proceedings of the 46th International Conference on Management of Data},
  pages = {405–417},
  projects = {},
  pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3448016.3457249},
  title = {To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data},
  doi = {10.1145/3448016.3457249},
  video = {https://www.youtube.com/watch?v=q_YCcP5mGIk&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq},
  keywords = {Provenance; Missing Answers},
  venueshort = {SIGMOD},
  longversionurl = {https://arxiv.org/pdf/2103.07561},
  year = {2021}
}

Reference

To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data Ralf Diestelkämper, Seokki Lee, Melanie Herschel and Boris Glavic Proceedings of the 46th International Conference on Management of Data (2021), pp. 405–417.