CS520 - Data Integration, Warehousing, and Provenance - 2023 Fall

Course Webpage for CS520 - 2023 Fall taught by Boris Glavic

Literature Review

Organization

Students have to form groups of 3 and each group will have to read a research paper, write a report, and give a 20 min presentation about the paper.

Presentation

The presentations will be given in a single block session on 11/30 in zoom . You can find the detailed presentation schedule here: Seminar Schedule. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. Please read the links below on how to give a good presentation.

Report

You will have to write a report that summarizes the content of the paper, explains its main ideas in a way understandable by the other students in the course (they should not have to read another 20 paper to understand what you are writing about), and gives an objective critic of the presented methods or systems. There are no page limitations, but try to avoid lengthy and verbose writing as well as short and incomprehensible reports. Again read some of the links below to get some ideas about how to write a good paper or report.

Late policies:

  • 1-3 days late: -10% points
  • 4-7 days late: -20% points
  • more than 7 days late: 0 points

Schedule

We expect the deliverables according to the following deadlines:

  • 08/31 - Deadline: Form groups
  • 09/12 - Literature review - Select literature review paper
  • 10/10 - Literature review - Read the paper and determine the structure of the report and meet with Prof./TA to discuss structure
  • 11/14 - Literature review and Data Curation Project - First draft of slides due for both and slide review meeting with Prof./TA. Show how you solved the data quality problems.
  • 11/30 - Literature Review and Data Curation Project - In-class Presentations
  • 11/30 - Literature review and Data Curation Project - Final versions of reports due

Help for writing the report, preparing slides, and giving a talk

How to give a presentation and prepare slides:

How to write a scientific article:

  • Page on how to write an CS article. Also comments on some general writing rules.
  • Simon Peyton Jones slides and video on how to write a great research paper

Literature Review Papers

You will have until 08/31 to form groups and until 09/12 to select what paper you want to review. We will send you a link to a form for voting on papers. You can access the pdfs of the papers on google drive (you need to log in with your IIT google account): https://drive.google.com/drive/folders/14mFrJDCge_JTxj56d-dJdCFD7vXQ-4yh?usp=sharing. In this semester you can select from the following papers:

Data Cleaning and Curation

  • Discovering Similarity Inclusion Dependencies, Youri Kaminsky, Eduardo H. M. Pena, Felix Naumann, Proc. ACM Manag. Data1 (1), 75:1--75:24, 2023.
  • Parallel Rule Discovery from Large Datasets by Sampling, Wenfei Fan, Ziyan Han, Yaoshu Wang, Min Xie, SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 384--398, 2022.
  • DQDF: Data-Quality-Aware Dataframes, Phanwadee Sinthong, Dhaval Patel, Nianjun Zhou, Shrey Shrivastava, Arun Iyengar, Anuradha Bhamidipaty, Proc. VLDB Endow.15 (4), 949--957, 2021.
  • Fast Detection of Denial Constraint Violations, Eduardo H. M. Pena, Eduardo Cunha de Almeida, Felix Naumann, Proc. VLDB Endow.15 (4), 859--871, 2021.
  • Efficient and Effective Data Imputation with Influence Functions, Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, Jun Wang, Jianwei Yin, Proc. VLDB Endow.15 (3), 624--632, 2021.
  • Automated Data Cleaning Can Hurt Fairness in Machine Learning-Based Decision Making, Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, Sebastian Schelter, 2023 IEEE 39th International Conference on Data Engineering (ICDE), pp. 3747--3754, 2023.
  • Schema Matching Using Pre-Trained Language Models, Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas C. Müller, Dalitso Banda, Fotis Psallidas, Jignesh M. Patel, 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pp. 1558--1571, 2023.
  • Data Dependencies for Query Optimization: A Survey, Jan Kossmann, Thorsten Papenbrock, Felix Naumann, VLDB J. 31 (1), 1--22, 2022.

Integration, Matching, and Mappings

  • SANTOS: Relationship-Based Semantic Table Union Search, Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald, Proc. ACM Manag. Data1 (1), 9:1--9:25, 2023.
  • Ground Truth Inference for Weakly Supervised Entity Matching, Renzhi Wu, Alexander Bendeck, Xu Chu, Yeye He, Proc. ACM Manag. Data1 (1), 32:1--32:28, 2023.
  • Flexer: Flexible Entity Resolution for Multiple Intents, Bar Genossar, Roee Shraga, Avigdor Gal, Proc. ACM Manag. Data1 (1), 42:1--42:27, 2023.
  • GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example, Saeed Fathollahzadeh, Matthias Boehm, Proc. ACM Manag. Data1 (2), 120:1--120:26, 2023.
  • Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples, Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chaudhuri, CoRRabs/2307.14565, 2023.
  • JEDI: These Aren'T the JSON Documents You'Re Looking for?, Thomas Hütter, Nikolaus Augsten, Christoph M. Kirsch, Michael J. Carey, Chen Li, SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 1584--1597, 2022.
  • Hierarchical Entity Resolution Using an Oracle, Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava, SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 414--428, 2022.
  • Witan: Unsupervised Labelling Function Generation for Assisted Data Programming, Benjamin Denham, Edmund M.-K. Lai, Roopak Sinha, M. Asif Naeem, Proc. VLDB Endow.15 (11), 2334--2347, 2022.
  • Entity Resolution on-Demand, Giovanni Simonini, Luca Zecchini, Sonia Bergamaschi, Felix Naumann, Proc. VLDB Endow.15 (7), 1506--1518, 2022.
  • MATE: Multi-Attribute Table Extraction, Mahdi Esmailoghli, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Proc. VLDB Endow.15 (8), 1684--1696, 2022.
  • A Critical Re-Evaluation of Neural Methods for Entity Alignment, Manuel Leone, Stefano Huber, Akhil Arora, Alberto Garcia-Durán, Robert West, Proc. VLDB Endow.15 (8), 1712--1725, 2022.
  • Windtunnel: Towards Differentiable ML Pipelines beyond a Single Modele, Gyeong-In Yu, Saeed Amizadeh, Sehoon Kim, Artidoro Pagnoni, Ce Zhang, Byung-Gon Chun, Markus Weimer, Matteo Interlandi, Proc. VLDB Endow.15 (1), 11--20, 2021.
  • Benchmarking Filtering Techniques for Entity Resolution, George Papadakis, Marco Fisichella, Franziska Schoger, George Mandilaras, Nikolaus Augsten, Wolfgang Nejdl, 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pp. 653--666, 2023.

Data Provenance

  • Xinsight: Explainable Data Analysis through the Lens of Causality, Pingchuan Ma, Rui Ding, Shuai Wang, Shi Han, Dongmei Zhang, Proc. ACM Manag. Data1 (2), 156:1--156:27, 2023.
  • Reptile: Aggregation-Level Explanations for Hierarchical Data, Zezhou Huang, Eugene Wu, CoRRabs/2103.07037 , 2021.
  • HypeR: Hypothetical Reasoning with What-If and How-to Queries Using a Probabilistic Causal Approach, Sainyam Galhotra, Amir Gilad, Sudeepa Roy, Babak Salimi, SIGMOD} '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 1598--1611, 2022.
  • Data Provenance for Recursive SQL Queries, Benjamin Dietrich, Tobias Müller, Torsten Grust, Proceedings of the 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, Philadelphia, Pennsylvania, 17 June 2022, pp. 9:1--9:8, 2022.
  • Computing the Shapley Value of Facts in Query Answering, Daniel Deutch, Nave Frost, Benny Kimelfeld, Mikaël Monet, SIGMOD} '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 1570--1583, 2022.
  • Computing How-Provenance for SPARQL Queries via Query Rewriting, Daniel Hernández, Luis Galárraga, Katja Hose, Proc. VLDB Endow.14 (13), 3389--3401, 2021.
  • Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Y. Zou, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 2242--2251, 2019.
  • EDA4SUM: Guided Exploration of Data Summaries, Aurélien Personnaz, Brit Youngmann, Sihem Amer-Yahia, Proc. VLDB Endow.15 (12), 3590--3593, 2022.
  • Erebus: Explaining the Outputs of Data Streaming Queries, Dimitris Palyvos-Giannas, Katerina Tzompanaki, Marina Papatriantafilou, Vincenzo Gulisano, Proc. VLDB Endow.16 (2), 230--242, 2022.
  • Data Provenance for SHACL, Thomas Delva, Anastasia Dimou, M Jakubowksi, Van den Bussche, Jan, EDBT2023, the 26th International Conference on Extending Database, 2023.
  • Towards Practical Approximate Lineage, Michael Leybovich, Oded Shmueli, Proceedings of the 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, Philadelphia, Pennsylvania, 17 June 2022, pp. 3:1--3:8, 2022.
  • Putting Things into Context: Rich Explanations for Query Answers Using Join Graphs, Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, Sudeepa Roy, Proceedings of the 46th International Conference on Management of Data, pp. 1051–1063, 2021.
  • Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V, Roee Shraga, Renée J. Miller, Proc. VLDB Endow.16 (6), 1587--1600, 2023.
  • FEDEX: An Explainability Framework for Data Exploration Steps, Daniel Deutch, Amir Gilad, Tova Milo, Amit Mualem, Amit Somech, Proc. VLDB Endow.15 (13), 3854--3868, 2022.
  • Computing Rule-Based Explanations by Leveraging Counterfactuals, Zixuan Geng, Maximilian Schleich, Dan Suciu, CoRRabs/2210.17071 , 2022.