IIT Database Group

Getting Involved

If you are interested in pursuing a Ph.D. at our group see the jobs page. Otherwise there are several options how to get involved in our research as a graduate or undergraduate student. We have several short term and long term projects that enable students to get hands-on experience in database research. This page just gives a first overview of what types of projects and master thesis topics are currently available. Please contact Dr. Glavic for more information. As a master student there are three ways how to do research with our group:

  • Volunteer work: Do some voluntary work with our group.
  • CS 597: Do a one semester project for credits.
  • Master thesis: This is usually at least a two semester commitment.

Master Thesis

Please contact Dr. Glavic if you are interested in doing a master thesis at our group. Many of the topics below are suitable for a Master thesis. You are welcome to suggest your own topic.

Current Master Thesis

  • Zhen Wang: Efficient Ranking of Explanations for Data Exchange Errors in Vagabond.

Open Student Projects and Thesis Topics

Below is a list of possible topics for student projects sorted by project. This is not a comprehensive list - more topics are available.

STBenchmark 2.0

For more information on the general project see here.

  • Adding Support for a Parallel Data Generator to STBenchmark 2.0: STBenchmark 2.0 is a benchmark for data integration systems. Given a configuration file the system generates a source and target schema, data for the source schema, and mappings that model how to map data from the source to the target schema. In this project a parallel data generator framework (PDGF) will be integrated with the STBenchmark 2.0 code to enable efficient data generation.
  • Improving Source and Target Schema Element Reuse in STBenchmark 2.0: STBenchmark 2.0 allows schema elements to be shared between multiple instances of mapping primitives which results in more complex and realistic mapping scenarios to be produced. This functionality has been implemented but many improvements are possible. For instance, a schema element can only be reused for a new primitive instance if certain requirements are met. The current implementation is conservative in deciding whether an element can be reused. In this project you will relax these conditions (which requires significant changes to the parts of the primitive generators) and come up with new ways to .
  • Implementing Composition or Inversion: Two typical mapping operator are Inversion (inverting the direction of a mapping to be able to produce a source from a target instance instead of a target from a source instance) and Composition (Creating a direct mapping from a schema S to a schema T based on a mapping between S and an intermediate schema U and another mapping from U to T). In this project you will implement a well-known composition or inversion operator in STBenchmark 2.0. This will enable more complex and realistic mappings to be produced.
  • Extend STBenchmark 2.0 with Error Generation: To be able to use the benchmark for the Vagabond project it needs to be able to inject errors into a generated data exchange scenario and store information about these errors. The goal of this project is to implement this error generation and storage.

Provenance using Temporal Databases (Collaboration with Oracle)

For more information on the general project see here.

  • Provenance for SQL Updates: Perm has comprehensive support for generating provenance for SQL queries. In this project we will develop an approach how to generate provenance for SQL update operations (INSERT, DELETE, UPDATE). Using a temporal database, i.e., past database states are preserved, you will use a mapping from update operations to queries over the temporal database to be able to apply modified versions of the Perm query rewrites to compute provenance information for update operations.
  • Implementing a Query-Rewrite Based Provenance Database Middleware: As part of a larger project in collaboration with Oracle you will participate in implementing a provenance middleware that enables computation of provenance using a standard relational database. This involved building an SQL parser, designing and implementing an internal query representation (e.g., a standard tree representation of SQL query blocks or a relational algebra representation), porting of provenance rewrites using advanced Oracle features, and implementing an SQL-serializer that transforms an internal rewritten query representation into SQL code. Several projects are possible, contact Dr. Glavic for details.

Native Database Provenance

For more information on the general project see here.

  • A Provenance Datatype: The Perm system enable generation and querying of provenance for relational databases. This functionality is implemented as an extension of PostgreSQL using query rewrite techniques to implement the provenance generation. Provenance is represented as standard relational data. In this project, the system is to be extended with a new declarative data type for storing provenance information and dynamic evaluation. That is instead of propagating the actual provenance data of a query result across several tuples, this data type will store a query that if executed evaluates to the provenance data. You will design the structure and byte representation of this data type, implement functions for efficiently manipulating values of the new data type, and implement support for dynamically evaluating the queries stored as values of this data type during query runtime. The ultimate goal is to apply this data type in the query rewrites and compare performance of the new rewrites with existing .
  • Native Provenance-aware Operators for a Database Execution Engine: Query-rewrite based techniques for provenance computation (such as the ones used in Perm) have the intrinsic disadvantage that their performance is restricted by the limitations of the SQL language and the operators supported by the database system's execution engine. This disadvantage can be partially overcome by implementing new provenance-aware physical operators that are specifically tailored for efficient provenance computation and integrating these operators with the database system and provenance rewrite techniques. Preliminary work on a provenance-aware aggregation operator demonstrate that performance improvements of several orders of magnitude are possible. Extending this work, you will implement new provenance-aware operators and improve the support for provenance-aware aggregation (e.g., build a cost-model for this operator, improve optimizer support, and improve compatibility of this operator with rewrites for other algebra operators).
  • Index Advisor for Provenance Queries: An index advisor proposed a set of indexes to the user of a DBMS that would (hopefully) speed up the performance of the system over a given workload (set of queries). In the first part of this project you will run experiments on Perm using a fixed workload and different indexes to figure out which indices benefit provenance queries and normal queries. The insights from this part will then be used to design an index advisor for queries with provenance.


For more information on the general project see here.

  • Vagabond Interface Performance Improvements: The Vagabond project investigates how to automatically generate explanations for errors in data exchange. The system consists of a user interface and an underlying explanation generation engine. This project aims at improving the graphical frontend to support large schemas and datasets. For example, the table viewers have to be update to work with multi million row tables efficiently and new visual exploration features are needed to deal with very large schemas that contain thousands of relations.
  • Database-side Batch Handling of Explanations and Error Sets: Vagabond retrieves data and provenance from an underlying DBMS during the generation and ranking of explanations. This can results in large amounts of data being transferred between the underlying TRAMP server and Vagabond Java middleware. In this project you will develop techniques for partially executing operations on error sets and explanations on the database backend. This will help to reduce the amount of data shipped between server and client. Vagabond generates the explanations for a set of errors by generating all explanations for each error in the set. Each such computation requires several database queries to be executed. The second part of this project is to compute one type of explanations for a large number of errors at once by executing a fixed number of queries. This will help us to benefit from the query optimization techniques of the underlying DBMS.
  • Support for Assertions about Correct Mapping Scenario Elements: The current version of Vagabond only allows the user to specify which elements (data, mappings, schema elements, ...) of a mapping scenario are incorrect. Explanations for these errors may invalidate arbitrary scenario elements as a side-effect. However, often the user will know (or assume) that certain elements in the data exchange scenario are correct. In this project you will add support for assertions about which elements of a data exchange scenario are correct. A representation of these assertions has to be designed and the explanation generation engine of Vagabond has to be modified to take these assertions into account (no explanations that invalidate the correct elements should be generated).
  • Interactive Explanation Refinement: Given ranked sets of explanations produced of Vagabond, the goal of this project is to enable the user to give feedback on which explanations are correct. This feedback should then be incorporated into the explanations by marking all side-effects of a correct explanation as additional errors and updating the ranking accordingly.
  • Proposing Corrections Based on Explanations: Currently, Vagabond only identifies the causes of errors (the explanations). In this project you will develop methods that suggest potential fixes for a given set of causes. For example, if some attribute values in the source instance have been identified as incorrect, a potential fix is to correct these values.