Literature Review

Overview

During the course you will read and summarize several research papers covering state-of-the-art techniques in Big Data processing. Each student will present one of these papers in class. Papers that are required reading have to be read by all students in the class.

For your presentation, you can select any paper that is not marked as required.

Please select a paper until 09/05.

The papers are available through google drive

Presentation and Report

Please prepare a 20-25 minute talk with slides to present the paper you have been assigned. The whole presentation including Q&A should be 30-35 minutes. Furthermore, you need to write a report explaining and criticizing the presented techniques.

The schedule for presentations is shown below.

The report is due on 11/15.

Help for writing the report, preparing slides, and giving a talk

How to give a presentation and prepare slides:

How to write a scientific article:

  • Page on how to write an CS article. Also comments on some general writing rules.
  • Simon Peyton Jones slides and video on how to write a great research paper

Presentation Schedule

Student Paper Presentation Date
Sharma The Google file system 10/05/21
Waghela Dynamo: Amazon’s highly available key-value store 10/05/21
Rizvi Apache flink: Stream and batch processing in a single engine 10/07/21
Singh Skipping-oriented partitioning for columnar layouts 10/12/21
Xie Big Data Analytics over Encrypted Datasets with Seabed 10/12/21
Patel Cassandra: a decentralized structured storage system 10/14/21
Campbell Incremental, Iterative Data Processing With Timely Dataflow 10/14/21
Cornelius Tracing nested data with structural provenance for big data analytics 10/19/21
Tang Kafka: A distributed messaging system for log processing 11/09/21
Mohammed Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources 11/11/21

List of Papers

Distributed Storage & NoSQL Databases

  • (REQUIRED) LSM-Based Storage Techniques: a Survey, Chen Luo, Michael J. Carey, VLDB J., 2020

  • (REQUIRED) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, ACM Trans. Comput. Syst., 2008

  • (REQUIRED) The hadoop distributed file system, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010

  • Autopart: Automating schema design for large scientific databases using data partitioning, S. Papadomanolakis, A. Ailamaki, , 2004

  • Skipping-oriented partitioning for columnar layouts, Liwen Sun, Michael J Franklin, Jiannan Wang, Eugene Wu, Proceedings of the VLDB Endowment, 2016

  • Optimal Bloom Filters and Adaptive Merging for LSM-Trees, Niv Dayan, Manos Athanassoulis, Stratos Idreos, ACM Trans. Database Syst., 2018

  • Cassandra: a decentralized structured storage system, Avinash Lakshman, Prashant Malik, ACM SIGOPS Operating Systems Review, 2010

  • Dynamo: Amazon’s highly available key-value store, G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, ACM SIGOPS Operating Systems Review, 2007

  • The Google file system, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, SIGOPS Oper. Syst. Rev. (5), 29–43, 2003

  • OctopusFS: A Distributed File System with Tiered Storage Management, Elena Kakoulli, Herodotos Herodotou, Proceedings of the 2017 ACM International Conference on Management of Data, 2017

Distributed Batch Processing

  • (REQUIRED) MapReduce: simplified data processing on large clusters, Jeffrey Dean, Sanjay Ghemawat, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, 2004

  • (REQUIRED) Spark: cluster computing with working sets, Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010

  • (REQUIRED) A comparison of join algorithms for log processing in MapReduce, Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pp. 975–986, 2010

  • All roads lead to rome: optimistic recovery for distributed iterative data processing, Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, Volker Markl, Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013

  • Incremental, Iterative Data Processing With Timely Dataflow, Derek Gordon Murray, Frank McSherry, Michael Isard, Rebecca Isaacs, Paul Barham, Martin Abadi, Commun. ACM, 2016

  • Distributed Join Algorithms on Thousands of Cores, Claude Barthels, Gustavo Alonso, Torsten Hoefler, Timo Schneider, Ingo Mueller, Proc. VLDB Endow., 2017

  • Runtime Code Generation in Cloudera Impala., Skye Wanderman-Milne, Nong Li, IEEE Data Eng. Bull., 2014

  • RIOS: Runtime Integrated Optimizer for Spark, Youfu Li, Mingda Li, Ling Ding, Matteo Interlandi, Proceedings of the ACM Symposium on Cloud Computing, SoCC 2018, Carlsbad, CA, USA, October 11-13, 2018, 2018

  • Hyracks: A flexible and extensible foundation for data-intensive computing, Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, Rares Vernica, Data Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011

  • A practical scalable distributed B-tree, Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah, Proc. VLDB Endow., 2008

  • AsterixDB: A scalable, open source BDMS, Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, others, Proceedings of the VLDB Endowment, 2014

  • SharedDB: killing one thousand queries with one stone, Georgios Giannikis, Gustavo Alonso, Donald Kossmann, Proceedings of the VLDB Endowment, 2012

  • In-Memory Performance for Big Data, Goetz Graefe, Haris Volos, Hideaki Kimura, Harumi A. Kuno, Joseph Tucek, Mark Lillibridge, Alistair C. Veitch, Proc. VLDB Endow., 2014

High-level Dataflow & Query Languages

  • (REQUIRED) Spark SQL: Relational data processing in Spark, Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia, Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2015

  • Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance, Gabor E. Gevay, Tilmann Rabl, Sebastian Bress, Lorand Madai-Tahy, Jorge-Arnulfo Quiane-Ruiz, Volker Markl, 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021, 2021

  • Representations and Optimizations for Embedded Parallel Dataflow Languages, Alexander Alexandrov, Georgi Krastev, Volker Markl, ACM} Trans. Database Syst., 2019

  • Spinning fast iterative data flows, Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, Volker Markl, Proceedings of the VLDB Endowment, 2012

  • Big Data Analytics with Datalog Queries on Spark, Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, Carlo Zaniolo, SIGMOD. ACM, 2016

  • Tracing nested data with structural provenance for big data analytics, Ralf Diestelk{“{a}}mper, Melanie Herschel, Proceedings of the 23rd International Conference on Extending Database Technology, {EDBT} 2020, Copenhagen, Denmark, March 30 - April 02, 2020, 2020

  • Orca: a modular query optimizer architecture for big data, Mohamed A Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, others, Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014

  • Apache Calcite: {A} Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources, Edmon Begoli, Jesus Camacho-Rodriguez, Julian Hyde, Michael J. Mior, Daniel Lemire, Proceedings of the 2018 International Conference on Management of Data, {SIGMOD} Conference 2018, Houston, TX, USA, June 10-15, 2018, 2018

Distributed Stream Processing & Publish-Subscribe

  • (REQUIRED) Kafka: A distributed messaging system for log processing, Jay Kreps, Neha Narkhede, Jun Rao, others, Proceedings of the NetDB, 2011

  • Apache flink: Stream and batch processing in a single engine, Paris Carbone, Stephan Ewen, Seif Haridi, Asterios Katsifodimos, Volker Markl, Kostas Tzoumas, Data Engineering, 2015

  • Storm@twitter, Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy V. Ryaboy, International Conference on Management of Data, {SIGMOD} 2014, Snowbird, UT, USA, June 22-27, 2014, 2014

  • TelegraphCQ: continuous dataflow processing, S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S.R. Madden, F. Reiss, M.A. Shah, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, 2003

Distributed Transaction Processing, Consensus, and Consensus

  • (REQUIRED) A New Presumed Commit Optimization for Two Phase Commit, Butler W. Lampson, David B. Lomet, 19th International Conference on Very Large Data Bases, August 24-27, 1993, Dublin, Ireland, Proceedings, 1993

  • (REQUIRED) In Search of an Understandable Consensus Algorithm., Diego Ongaro, John K Ousterhout, USENIX Annual Technical Conference, 2014

  • (REQUIRED) H-store: a high-performance, distributed main memory transaction processing system, Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan PC Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, others, Proceedings of the VLDB Endowment, 2008

  • Calvin: fast distributed transactions for partitioned database systems, Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Daniel J Abadi, Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012

  • The Memsql Query Optimizer: {A} Modern Optimizer for Real-Time Analytics in a Distributed Database, Jack Chen, Samir Jindel, Robert Walzer, Rajkumar Sen, Nika Jimsheleishvilli, Michael Andrews, Proc. {VLDB} Endow., 2016

  • Spanner: Google’s Globally-Distributed Database, J.C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, JJ Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, others, OSDI, 2012

  • Highly Available Transactions: Virtues and Limitations, Peter Bailis, Aaron Davidson, Alan D. Fekete, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica, Proc. {VLDB} Endow., 2013

  • Distributed snapshot isolation: global transactions pay globally, local transactions pay locally, Carsten Binnig, Stefan Hildenbrand, Franz F{“a}rber, Donald Kossmann, Juchang Lee, Norman May, The VLDB Journal—The International Journal on Very Large Data Bases, 2014