Storing and Accessing Large Volumes of
High-Dimensional Scientific Data
Abstract
Scientific data repositories are so massive they are frequently maintained on tertiary storage. These data are often characterized by very high dimensionality as well. In this seminar, we will discuss retrieval and clustering techniques for large volumes of scientific data with many dimensions. One group of these techniques supports efficient manipulation of scientific data on secondary storage. The other group is designed to speed up retrieval in massive volumes of multi-dimensional data on tertiary storage. While the primary application of interest is high-energy physics, the techniques are appropriate for many other areas of scientific endeavor.
These retrieval and clustering techniques build on the properties of two new and original space-partitioning strategies, called Gamma and Theta. In contrast to traditional space-partitioning schemes, which require 2**d divisions of a d-dimensional space to split each axis at least once, the new partitioning strategies achieve the same effect with only O(d) regions. They can effectively handle highly skewed distributions and partially specified search predicates that characterize the data and queries, respectively, of advanced scientific studies. The new partitioning strategies can also facilitate the process of clustering large volumes of high-dimensional data in a way that avoids costly dimensionality reduction. When applied to a large data set, the clustering technique performs only one scan over the entire set.