CS-422 - Assignment 1 (5%)

Exploring Data

Due by: February 14, 2013

Assignment Specifications

In this assignment you will exercise exploring data, pre-processing, and visualization.

  1. Download the ``wine-quality'' data set from the UCI repository: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/. Make sure to download and examine the winequality.names file.

    1. Load the winequality-red.csv file into the workspace.

      1. Separate the wine data into a low quality class (quality $\leq5$) and a high quality class (quality $>5$), find the mean and standard deviation for each of the attributes for the two classes. Based on the statistical information, describe the most different properties of low quality red wines and high quality red wines.
      2. Plot the correlation between ``residual sugar'', ``total sulfur dioxide'' and ``alcohol'' for all the red wines, use red to draw the low quality class (quality $\leq5$) and blue to draw the high quality class (quality $>5$).
    2. Without quitting R, loadthe winequality-white.csv file into the workspace.

      1. Merge the red wine data with the white wine data and show the histogram of ``quality'' for all the wines. Change the bin size to 2 and observe the difference.
      2. Create a data frame by using the first 50 records of red wines and the first 50 records of white wines, use a parallel coordinates plots to draw this data using the following four attributes: ``citric acid'', ``residual sugar'', ``density'', and ``quality''.

  2. Download the ``adult'' data set from the UCI repository http://archive.ics.uci.edu/ml/machine-learning-databases/adult/. Make sure to download and examine the ``adult.names'' file.

    1. Load the training set data ``adult.data'' into the workspace.

      1. Show a histogram of ``race'' for people whose native country is ``United-States''.
      2. Show a box plot of ``education-num'', ``capital-gain'' and ``hours-per-week''.
      3. Convert ``workclass'' and ``race'' to be numerical attributes (integer). Use a three dimensional plot to show the relationship between ``workclass'', ``race'', and ``hours-per-week''.
    2. More data exploration:

      1. Select any two attributes and show their joint histogram, what conclusion can you draw from the histogram?
      2. Select any three attributes and plot their relationship using 2D scatter plot, use one of the selected attributes as the color code when plotting, what can you say about the correlation of these attributes?

General comments

Electronic Submission Instructions

Please follow the following submission procedure:

  1. The programs you write must be written using R.

  2. Direct all questions/comments regarding the assignment to: $cs.iit.edu$

  3. On or before the due date upload the assignment submission into blackboard as a zip file. Please do not send assignments via email.

    Note: we must be able to view your report and execute your program in order to grade it

  4. The organization of the submitted material should be as follows:

  5. Do not submit a paper copy of your report. You will be contacted by email if some material is missing or if you will need to meet with the TA.

  6. If you are late in the submission, upload it to blackboard and send an email to $cs.iit.edu$ indicating that you have uploaded a late submission. ``late days'' will be determined by your email date.



Gady Agam 2013-01-31