## Relevance-based Data Management

Most provenance research has operated under the assumption that provenance information will be consumed by humans to support use cases such as auditing and debugging. In this project, we investigate novel applications of provenance as a supporting technology. These new applications range from low-level systems aspects, e.g., improving the performance of query processing and reducing resource usage, to high-level user-facing functionality, e.g., assessing the value of data based on provenance. We use Relevance-based data management as an umbrella term for such techniques. An overarching scheme in this research is based on the realization that large parts of data in databases and data lakes are dark data, i.e., they are not used or they are not relevant for answering queries. We conjecture that by identifying what data is relevant for what tasks we can significantly reduce the size of working sets (the data needed for a task) which in turn leads to not just improved performance, but also improved usability (helping users identify what data to use).

## Relevance-based Query Processing

A general technique applied widely in query processing and optimization is to statically analyze queries to be able to determine upfront which subset of the input data is sufficient for answering the query and then restrict query evaluation to this subset, if possible exploiting physical design to reduce I/O. For instance, this technique underlies how database optimizers use index structures and decide which fragments of a horizontally-partitioned table have to be read. A major drawback of this approach is that it relies solely on a static analysis of the query which limits its applicability mainly to selection conditions that can be enforced over base tables. This is not effective for important classes of queries which are selective (only a small fraction of the input data is needed to compute the result), but where it is not possible to determine statically what data is relevant. Types of queries which fit this characterization include selection over aggregation results (e.g., HAVING clause) and top-k queries (e.g., return the 3 departments with the most employees). Instead of relying on static analysis of query structure alone we propose to use provenance information to determine what data is relevant for which query.

Initially, we will study relevance-based data skipping as on instantiation of this idea. Given a query, we will determine at runtime a subset of the data that is sufficient for answering the query. This information will then be exploited to speed-up subsequent executions of the same of similar queries by skipping data that is not relevant for answering this query. We propose to investigate light-weight capture mechanisms for coarse-grained provenance which will serve as concise descriptors of what data is relevant for a query. Furthermore, we will extend database systems to be able to skip data based on such “provenance sketches”.

In the long run, we plan to integrate relevance-based techniques with a wide range of query execution and query optimization methods such as caching, ILM, and self-tuning.

## Assessing the Value of Data

We envision to use relevance information to assess the value of data with respect to a given query, workload, or application. We will exploit this objective measure of data value for guiding users in finding data relevant to a task. In particular, this functionality will be used to recommend data to users of a data lake in a context-sensitive manner. Furthermore, by requesting feedback from users about the recommendations we can separate dark data into data that is relevant but its relevance has not been recognized yet and data that is simply not useful.

### Collaborators

• Dieter Gawlick - Oracle
• Danica Porobic - Oracle
• Kenny Gross - Oracle
• Vasudha Krishnaswamy - Oracle
• Zhen Hua Liu - Oracle

### Funding

• Oracle - Exploiting Provenance to Enhance Data Management (2018 - 2019), $95,248, PIs: Boris Glavic • Oracle - Self-tuning Database Operations by Assessing the Importance of Data (2019 - 2020),$97,507, PIs: Boris Glavic