Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. Ariadne is a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained stream provenance. In addition to applying this technique to compute provenance eagerly during query execution, we also investigated how to decouple provenance computation from query processing to reduce runtime overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We have developed stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne is implemented as an extension of the Borealis DSMS.
Features
On-demand provenance generation for streaming queries
Instrument parts of the query network to propagate provenance
Temporary preservation of inputs and reconstruction for provenance retrieval
Eager generation of provenance
Lazy generation of provenance through replay
Compression techniques for provenance
Operator Instrumentation
We call the main approach that Ariadne uses to generate provenance for a streaming query network operator instrumentation. The key idea behind our operator instrumentation approach is to extend each
operator implementation so that the operator is able to annotate its output
with provenance information based on provenance annotations on its inputs.
So far, we are not aware of any system that actually implements such an
approach. Under operator instrumentation, provenance annotations are processed in line with the regular data.
That is, the structure of the original query network is kept as is (operators are simply replaced
with their instrumented counterparts).
Thus, most issues caused by non-determinism are dealt with in a rather
natural way, since the execution of the original query network is traced. Furthermore, provenance for a subnetwork can be traced by instrumenting only operators in that subnetwork.
The only drawback of operator instrumentation is the need to extend all
operators. However, this extension can be implemented with reasonable effort.
Lazy and Eager Provenance Generation and Retrieval
With operator instrumentation, provenance can be generated either eagerly during query execution (our default approach)
or lazily upon request. We support both types of generation, because their performance characteristics in terms of storage, runtime, and retrieval overhead are different. This enables the user to trade runtime-overhead on the original query network for storage cost and runtime-overhead when retrieving provenance.