Relevance

Relevance-based Data Management

Most provenance research has operated under the assumption that provenance information will be consumed by humans to support use cases such as auditing and debugging. In this project, we investigate novel applications of provenance as a supporting technology. These new applications range from low-level systems aspects, e.g., improving the performance of query processing and reducing resource usage, to high-level user-facing functionality, e.g., assessing the value of data based on provenance. We use Relevance-based data management as an umbrella term for such techniques. An overarching scheme in this research is based on the realization that large parts of data in databases and data lakes are dark data, i.e., they are not used or they are not relevant for answering queries. We conjecture that by identifying what data is relevant for what tasks we can significantly reduce the size of working sets (the data needed for a task) which in turn leads to not just improved performance, but also improved usability (helping users identify what data to use).

Provenance-based Data Skipping (PBDS)

A general technique applied widely in query processing and optimization is to statically analyze queries to be able to determine upfront which subset of the input data is sufficient for answering the query and then restrict query evaluation to this subset, if possible exploiting physical design to reduce I/O. For instance, this technique underlies how database optimizers use index structures and decide which fragments of a horizontally-partitioned table have to be read. A major drawback of this approach is that it relies solely on a static analysis of the query which limits its applicability mainly to selection conditions that can be enforced over base tables. This is not effective for important classes of queries which are selective (only a small fraction of the input data is needed to compute the result), but where it is not possible to determine statically what data is relevant. Types of queries which fit this characterization include selection over aggregation results (e.g., HAVING clause) and top-k queries (e.g., return the 3 departments with the most employees). Instead of relying on static analysis of query structure alone we propose to use provenance information to determine what data is relevant for which query.

Initially, we will study relevance-based data skipping as on instantiation of this idea. Given a query, we will determine at runtime a subset of the data that is sufficient for answering the query. This information will then be exploited to speed-up subsequent executions of the same of similar queries by skipping data that is not relevant for answering this query. We propose to investigate light-weight capture mechanisms for coarse-grained provenance which will serve as concise descriptors of what data is relevant for a query. Furthermore, we will extend database systems to be able to skip data based on such “provenance sketches”.

In the long run, we plan to integrate PBDS techniques with a wide range of query execution and query optimization methods such as caching, ILM, and self-tuning.

Assessing the Value of Data

We envision to use relevance information to assess the value of data with respect to a given query, workload, or application. We will exploit this objective measure of data value for guiding users in finding data relevant to a task. In particular, this functionality will be used to recommend data to users of a data lake in a context-sensitive manner. Furthermore, by requesting feedback from users about the recommendations we can separate dark data into data that is relevant but its relevance has not been recognized yet and data that is simply not useful.

Collaborators

Dieter Gawlick - Oracle
Danica Porobic - Oracle
Vasudha Krishnaswamy - Oracle
Zhen Hua Liu - Oracle

Funding

Oracle - Exploiting Provenance to Enhance Data Management (2018 - 2019), $95,248, PIs: Boris Glavic
Oracle - Self-tuning Database Operations by Assessing the Importance of Data (2019 - 2020), $97,507, PIs: Boris Glavic
Oracle - Exploiting Provenance to Enhance Data Management (2020 - 2021), $25,000 (cloud credits), PIs: Boris Glavic
Oracle - Self-tuning Database Operations by Assessing the Importance of Data (extension) (2021 - 2022), $98,077, PIs: Boris Glavic
Oracle - Exploiting Provenance to Enhance Data Management (extension) (2021 - 2022), $25,000 (cloud credits), PIs: Boris Glavic
Oracle - Self-tuning Database Operations by Assessing the Importance of Data (2nd extension) (2022 - 2023), $98,077, PIs: Boris Glavic
Oracle - Exploiting Provenance to Enhance Data Management (extension) (2022 - 2023), $34,000 (cloud credits), PIs: Boris Glavic

Publications

Self-tuning Database Operations by Assessing the Importance of Data
Boris Glavic, Pengyuan Li and Ziyu Liu
Technical Report #IIT/CS-DB-2023-01
Illinois Institute of Technology.

pdf
details

@techreport{GL23,
  author = {Glavic, Boris and Li, Pengyuan and Liu, Ziyu},
  title = {Self-tuning Database Operations by Assessing the Importance of Data},
  institution = {Illinois Institute of Technology},
  year = {2023},
  number = {IIT/CS-DB-2023-01},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GL23.pdf},
  projects = {Relevance-based Data Management},
  keywords = {Provenance, Relevance-based Data Management},
  venueshort = {Techreport}
}

details

Oracle PBDS Experiments
Boris Glavic, Xing Niu, Pengyuan Li and Ziyu Liu
Technical Report #IIT/Cs-db-2022-01
Illinois Institute of Technology.

pdf
details

@techreport{GN22,
  author = {Glavic, Boris and Niu, Xing and Li, Pengyuan and Liu, Ziyu},
  title = {Oracle PBDS Experiments},
  institution = {Illinois Institute of Technology},
  year = {2022},
  number = {IIT/Cs-db-2022-01},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GN22.pdf},
  projects = {Relevance-based Data Management},
  keywords = {Provenance, Relevance-based Data Management},
  venueshort = {Techreport}
}

details

Provenance-based Data Skipping
Xing Niu, Ziyu Liu, Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu and Danica Porobic
Proceedings of the VLDB Endowment. 15, 3 (2021) , 451–464.
- doi
- pdf
- extended version
- details
```
@article{NL21,
  author = {Niu, Xing and Liu, Ziyu and Li, Pengyuan and Glavic, Boris and Gawlick, Dieter and Krishnaswamy, Vasudha and Liu, Zhen Hua and Porobic, Danica},
  keywords = {Provenance, Data Skipping, Relevance-based Data Management},
  title = {Provenance-based Data Skipping},
  journal = {Proceedings of the VLDB Endowment},
  projects = {Relevance-based Data Management},
  pages = {451 - 464},
  volume = {15},
  issue = {3},
  year = {2021},
  doi = {10.14778/3494124.3494130},
  venueshort = {{PVLDB}},
  pdfurl = {https://vldb.org/pvldb/vol15/p451-niu.pdf},
  longversionurl = {https://arxiv.org/pdf/2104.12815}
}
```
Database systems use static analysis to determine upfront which data is needed for answering a query and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k queries, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps.
details

Integrating Provenance Management and Query Optimization
Xing Niu
Illinois Institute of Technology.

pdf
details

@phdthesis{N21,
  venueshort = {PhD Thesis},
  author = {Niu, Xing},
  keywords = {Provenance, Cost-based optimization, Data Skipping, Relevance-based Data Management},
  month = dec,
  pdfurl = {https://media.proquest.com/media/hms/PFT/2/W8M8M?cit%3Aauth=Niu%2C+Xing&cit%3Atitle=Integrating+Provenance+Management+and+Query+Optimization&cit%3Apub=ProQuest+Dissertations+and+Theses&cit%3Avol=&cit%3Aiss=&cit%3Apg=&cit%3Adate=2021&ic=true&cit%3Aprod=ProQuest&_a=ChgyMDIyMDgzMDIwMjQxNzgzNDoxOTYzOTYSBTg0NTY5GgpPTkVfU0VBUkNIIg0yNC4xMzYuMjQuMjIyKgUxODc1MDIKMjYyMjkzMDc1MDoNRG9jdW1lbnRJbWFnZUIBMFIGT25saW5lWgJGVGIDUEZUagoyMDIxLzAxLzAxcgoyMDIxLzEyLzMxegCCASlQLTEwMDg3NTItMjgzNzctQ1VTVE9NRVItMTAwMDAyMDUtNDM1NTkyMpIBBk9ubGluZcoBVE1vemlsbGEvNS4wIChNYWNpbnRvc2g7IEludGVsIE1hYyBPUyBYIDEwLjE1OyBydjoxMDMuMCkgR2Vja28vMjAxMDAxMDEgRmlyZWZveC8xMDMuMNIBFkRpc3NlcnRhdGlvbnMgJiBUaGVzZXOaAgdQcmVQYWlkqgIrT1M6RU1TLU1lZGlhTGlua3NTZXJ2aWNlLWdldE1lZGlhVXJsRm9ySXRlbcoCE0Rpc3NlcnRhdGlvbi9UaGVzaXPSAgFZ8gIA%2BgIBToIDA1dlYooDHENJRDoyMDIyMDgzMDIwMjQxNzgzNDo3OTIwNDY%3D&_s=oFyeSaHKMuftg4s4wmEJRSOs%2B4U%3D},
  projects = {GProM; Relevance-based Data Management},
  school = {Illinois Institute of Technology},
  title = {Integrating Provenance Management and Query Optimization},
  year = {2021}
}

details