## IIT Database Group

Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. Ariadne is a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained stream provenance. In addition to applying this technique to compute provenance eagerly during query execution, we also investigated how to decouple provenance computation from query processing to reduce runtime overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We have developed stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne is implemented as an extension of the Borealis DSMS.

### Features

• On-demand provenance generation for streaming queries
• Instrument parts of the query network to propagate provenance
• Temporary preservation of inputs and reconstruction for provenance retrieval
• Eager generation of provenance
• Lazy generation of provenance through replay
• Compression techniques for provenance

### Operator Instrumentation

We call the main approach that Ariadne uses to generate provenance for a streaming query network operator instrumentation. The key idea behind our operator instrumentation approach is to extend each operator implementation so that the operator is able to annotate its output with provenance information based on provenance annotations on its inputs. So far, we are not aware of any system that actually implements such an approach. Under operator instrumentation, provenance annotations are processed in line with the regular data. That is, the structure of the original query network is kept as is (operators are simply replaced with their instrumented counterparts). Thus, most issues caused by non-determinism are dealt with in a rather natural way, since the execution of the original query network is traced. Furthermore, provenance for a subnetwork can be traced by instrumenting only operators in that subnetwork. The only drawback of operator instrumentation is the need to extend all operators. However, this extension can be implemented with reasonable effort.

### Lazy and Eager Provenance Generation and Retrieval

With operator instrumentation, provenance can be generated either eagerly during query execution (our default approach) or lazily upon request. We support both types of generation, because their performance characteristics in terms of storage, runtime, and retrieval overhead are different. This enables the user to trade runtime-overhead on the original query network for storage cost and runtime-overhead when retrieving provenance.

The figure above shows an overview of the reduced-eager operator instrumentation approach. We temporarily store the input tuples for the instrumented parts of the network (e.g., for input streams $$S_1$$ and $$S_2$$, since we want provenance for the entire query network). The tuples in the output stream of the instrumented network carry the provenance annotations as described above. Recall that a provenance annotation is the set of identifiers for the tuples in the provenance of an output. Provenance is reconstructed for retrieval from the annotations using a new operator called p-join ($$\ltimes$$). For each output tuple $$t$$, this operator retrieves all input tuples in the provenance using the set of identifiers from the provenance annotation and outputs all combinations of $$t$$ with a tuple from its provenance. Each of these combinations is outputted as a single tuple to stream $$P$$.

We call this approach Reduced-Eager, because we are eagerly propagating a reduced form of provenance (the tuple identifier sets) during query execution and lazily reconstructing provenance independent of the execution of the original network. In comparison with using sets of full tuples as annotations, this approach pays a price for storing and reconstructing tuples. However, because compressed representations can be used, this cost is offset by a significant reduction in provenance generation cost (in terms of both runtime and latency). Since reconstruction is separate from generation, we can often avoid reconstructing complete provenance tuples during provenance retrieval, e.g., if the user only requests provenance for some results (query over provenance).

### Publications

1. Efficient Stream Provenance via Operator Instrumentation
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Transactions on Internet Technology. 13, 1 (2014) , 7:1–7:26.
details
2. Ariadne: Managing Fine-Grained Provenance on Data Streams
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems (2013), pp. 291–320.
details
3. Ariadne: Managing Fine-Grained Provenance on Data Streams
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Technical Report #771
ETH Zürich.
details
4. The Case for Fine-Grained Stream Provenance
Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer and Nesime Tatbul
Proceedings of the 1st Workshop on Data Streams and Event Processing collocated with BTW (2011), pp. 58–61.
details