Data Curation Project
You will apply the techniques learned in class to clean and integrate one or more real world datasets. The data curation project will be done in the same groups as the paper review. You will have to:
- Acquire or extract one or more real world datasets for a domain of choice.
- Gain an understanding of the data and identify data quality issues
- Research tools that are suited for the data cleaning, integration, extraction tasks that you need to apply
We will use Vizier an open source notebook system that is similar to Jupyther or Apache Zeppelin, but which provides additional functionality not found in these tools. Vizier is available at https://vizierdb.info/.
- Vizier exported notebook (and other code) committed to your group's git repository.
- The notebook also serves as a report describing your data curation project.
Source Code and Dataset Management
The project will be completed in groups of three students. Groups will be determined in the first days of class. The groups will do both the project and paper reviews together.
Once groups are finalized, you will receive an invitation to collaborate on a shared github repository named
cs520-s22-groupnumber. All your work in this class will be submitted via your shared private repository. For large files consider using cloud storage. If the dataset is publicly available, then including a file with links in the repository is sufficient.
The presentations will be given in a single block session on 04/29 in zoom . You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. You can find the detailed presentation schedule here: Seminar Schedule. The presentation will be 10 min long and should cover the following:
- Introduce the dataset(s), why you have chosen them, and how you have acquired them: what is the domain (e.g., chicago parking data)? what are the characteristics (dataset size, number of attributes, what data format)? …
- Give an overview of data quality problems you have identified in the data and methodology/tools used to identify them.
- Explain how you have overcome (tried to overcome) these problems, what tools were used, and what were the challenges.
Ideas for datasets
- Open government initiatives, e.g., City of Chicago Data Portal
- Extract from web or web services, e.g., twitter