Scribe Notes on Evaluating Window Joins over Unbounded Streams.

by Joe Prokop (prokjos@iit)

This paper studies the best way to deal with various situations that occur when joining streams of data. First it looks at which join algorithm to use and when. The problem with choosing an algorithm is that traditional selection methods calculate the time to completion for each algorithm and select the shortest one. With streaming data however, the query time is infinite for all algorithms. Next they study how to join streams with different data rates. They look at how different algorithms are affected by these varying data rates. Finally they look at what to do when the system does not have enough computing/memory resources to keep up with data flow from the incoming streams.

Before comparing join algorithms, the authors of this paper had to create a new way to measure the cost of different join algorithms. The traditional method of estimating the total time to perform a join doesn?t work when using infinitely large data sets from streams. The author?s solution to this is to measure the cost of joins with a unit-time-basis model. To do this they broke the process of joining streams into several parts. First for each data stream (one-way) three distinct things need to happen. First the window of tuples from the second data stream must be scanned for tuples to join with. Second the tuple created after the join must be inserted in the resulting data stream. Finally tuples in the window must be invalidated if they no longer belong in the window because of age. These three steps are taken for each data stream each time a new tuple arrives. For just one data stream this process is called a one-way join.

The four joining algorithms that the authors compare in this paper are nested loop join, hash join, b+tree index nested loop join, and t-tree index nested loop join. Nested loop join stores the tuples in the window in a list and scans through them linearly for every tuple in the window from the other stream. This is the simplest and most expensive to search but also the least expensive structure to add and remove tuples from. Hash join stores the tuples in the window in a hash table so not all tuples in the second window need to be scanned for every tuple being joined with it. The b+tree index join stores the tuples in the window in a b+tree. The b+tree is faster then the hash join (for large bucket sizes) but inserting and deleting from the tree is more expensive. The t-tree is similar to the b+tree index only it contains keys at non-leaf nodes whereas the b+tree only contains keys at leaf nodes.

To maximize the efficiency of these joins the authors of this paper propose using different join algorithms on each side of the join. Their results show, assuming unlimited resources, that if the two data streams have a large difference in arrival rate of tuples then the stream with the faster arrival rate should use a t-tree and the slower data stream should use a simple nested loop. If the two streams still have uneven speeds, but not wildly different, then the faster stream should use the hash join and the slower the t-tree. If both streams are about equal in speed then both sides of the join should use the hash join algorithm.

The authors also studied how to maximize the amount of joins computed when the computer only has limited resources. To maximize the amount of joins computed when a computer does not have the processing power to do all the joins the paper shows that the stream with the slower rate of tuples arriving should get priority. When the system has limited memory it was found that the slower stream should get priority over the memory usage. This is because for the fast moving stream you can join each tuple with the entire window from the slower stream. If you kept the faster stream in memory then you would only join with the fewer number of tuples that come through the slow stream.

Although I've never actually thought about different join methods for streaming data before this, I never would expected some of their results. For one thing it was surprising that the nested loop method was ever one of the best performers even in combination with another type of join. It was a pretty interesting paper.
First draft, submitted 11/19/2003, posted 11/20/2003