Ariadne
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. Ariadne is a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained stream provenance. In addition to applying this technique to compute provenance eagerly during query execution, we also investigated how to decouple provenance computation from query processing to reduce runtime overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We have developed stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne is implemented as an extension of the Borealis DSMS.
Features
- On-demand provenance generation for streaming queries
- Instrument parts of the query network to propagate provenance
- Temporary preservation of inputs and reconstruction for provenance retrieval
- Eager generation of provenance
- Lazy generation of provenance through replay
- Compression techniques for provenance
Operator Instrumentation
We call the main approach that Ariadne uses to generate provenance for a streaming query network operator instrumentation. The key idea behind our operator instrumentation approach is to extend each operator implementation so that the operator is able to annotate its output with provenance information based on provenance annotations on its inputs. So far, we are not aware of any system that actually implements such an approach. Under operator instrumentation, provenance annotations are processed in line with the regular data. That is, the structure of the original query network is kept as is (operators are simply replaced with their instrumented counterparts). Thus, most issues caused by non-determinism are dealt with in a rather natural way, since the execution of the original query network is traced. Furthermore, provenance for a subnetwork can be traced by instrumenting only operators in that subnetwork. The only drawback of operator instrumentation is the need to extend all operators. However, this extension can be implemented with reasonable effort.
Lazy and Eager Provenance Generation and Retrieval
With operator instrumentation, provenance can be generated either eagerly during query execution (our default approach) or lazily upon request. We support both types of generation, because their performance characteristics in terms of storage, runtime, and retrieval overhead are different. This enables the user to trade runtime-overhead on the original query network for storage cost and runtime-overhead when retrieving provenance.
The figure above shows an overview of the reduced-eager operator instrumentation approach. We temporarily store the input tuples for the instrumented parts of the network (e.g., for input streams \(S_1\) and \(S_2\), since we want provenance for the entire query network). The tuples in the output stream of the instrumented network carry the provenance annotations as described above. Recall that a provenance annotation is the set of identifiers for the tuples in the provenance of an output. Provenance is reconstructed for retrieval from the annotations using a new operator called p-join (\(\ltimes\)). For each output tuple \(t\), this operator retrieves all input tuples in the provenance using the set of identifiers from the provenance annotation and outputs all combinations of \(t\) with a tuple from its provenance. Each of these combinations is outputted as a single tuple to stream \(P\).
We call this approach Reduced-Eager, because we are eagerly propagating a reduced form of provenance (the tuple identifier sets) during query execution and lazily reconstructing provenance independent of the execution of the original network. In comparison with using sets of full tuples as annotations, this approach pays a price for storing and reconstructing tuples. However, because compressed representations can be used, this cost is offset by a significant reduction in provenance generation cost (in terms of both runtime and latency). Since reconstruction is separate from generation, we can often avoid reconstructing complete provenance tuples during provenance retrieval, e.g., if the user only requests provenance for some results (query over provenance).
Collaborators
- Kyumars Sheyk Esmaili
- Nesime Tatbul - Intel Labs
- Peter M. Fischer - University of Augsburg
Publications
-
Efficient Stream Provenance via Operator Instrumentation
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Transactions on Internet Technology. 13, 1 (2014) , 7:1–7:26.@article{GE14, author = {Glavic, Boris and Esmaili, Kyumars Sheykh and Fischer, Peter M. and Tatbul, Nesime}, date-added = {2014-05-11 17:49:19 +0000}, date-modified = {2014-05-11 17:55:40 +0000}, journal = {Transactions on Internet Technology}, keywords = {Ariadne; Provenance}, number = {1}, pages = {7:1-7:26}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE14.pdf}, projects = {Ariadne}, title = {Efficient Stream Provenance via Operator Instrumentation}, venueshort = {TOIT}, volume = {13}, year = {2014}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE14.pdf} }
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. Our proposals include computing a concise superset of the provenance (to allow lazily replaying a query and reconstruct its provenance) as well as lazy retrieval (to avoid unnecessary reconstruction of provenance). We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
-
Ariadne: Managing Fine-Grained Provenance on Data Streams
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems (2013), pp. 291–320.@inproceedings{GE13, author = {Glavic, Boris and Esmaili, Kyumars Sheykh and Fischer, Peter M. and Tatbul, Nesime}, booktitle = {Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems}, date-added = {2013-05-13 14:03:56 +0000}, date-modified = {2013-06-02 19:45:50 +0000}, keywords = {Ariadne; Provenance}, pages = {291-320}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE13.pdf}, projects = {Ariadne}, slideurl = {http://www.slideshare.net/lordPretzel/2013-debs-Ariadne}, title = {Ariadne: Managing Fine-Grained Provenance on Data Streams}, venueshort = {DEBS}, year = {2013}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE13.pdf} }
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
-
Ariadne: Managing Fine-Grained Provenance on Data Streams
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Technical Report #771
ETH Zürich.@techreport{GE12, author = {Glavic, Boris and Esmaili, Kyumars Sheykh and Fischer, Peter M. and Tatbul, Nesime}, date-added = {2012-12-14 18:55:49 +0000}, date-modified = {2012-12-18 17:16:08 +0000}, institution = {ETH Z{\"u}rich}, keywords = {Ariadne; Provenance}, number = {771}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE12.pdf}, projects = {Ariadne}, title = {Ariadne: Managing Fine-Grained Provenance on Data Streams}, venueshort = {Techreport}, year = {2012}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE12.pdf} }
-
The Case for Fine-Grained Stream Provenance
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Proceedings of the 1st Workshop on Data Streams and Event Processing collocated with BTW (2011), pp. 58–61.@inproceedings{GE11, author = {Glavic, Boris and Esmaili, Kyumars Sheykh and Fischer, Peter M. and Tatbul, Nesime}, bibsource = {DBLP, http://dblp.uni-trier.de}, booktitle = {Proceedings of the 1st Workshop on Data Streams and Event Processing collocated with BTW}, crossref = {DBLP:conf/btw/2011w}, date-added = {2012-12-14 18:55:49 +0000}, date-modified = {2012-12-18 17:16:17 +0000}, isworkshop = {true}, keywords = {Ariadne; Provenance}, pages = {58-61}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE11.pdf}, projects = {Ariadne}, title = {{The Case for Fine-Grained Stream Provenance}}, venueshort = {DSEP}, year = {2011}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GE11.pdf} }
The current state of the art for provenance in data stream management systems (DSMS) is to provide provenance at a high level of abstraction (such as, from which sensors in a sensor network an aggregated value is derived from). This limitation was imposed by high-throughput requirements and an anticipated lack of application demand for more detailed provenance information. In this work, we first demonstrate by means of well-chosen use cases that this is a misconception, i.e., coarse-grained provenance is in fact insufficient for many application domains. We then analyze the requirements and challenges involved in integrating support for fine-grained provenance into a streaming system and outline a scalable solution for supporting tuple-level provenance in DSMS.