CS-422 - Assignment 1 (5%)
Exploring Data
Due by: February 14, 2013
In this assignment you will exercise exploring data, pre-processing,
and
visualization.
- Download the ``wine-quality'' data set from the UCI repository: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/.
Make sure to download and examine the winequality.names file.
- Load the winequality-red.csv file into the workspace.
- Separate the wine data into a low quality class (quality )
and a high quality class (quality ), find the mean and
standard
deviation for each of the attributes for the two classes. Based on
the statistical information, describe the most different
properties of low quality red wines and high quality red wines.
- Plot the correlation between ``residual sugar'', ``total
sulfur
dioxide'' and ``alcohol'' for all the red wines,
use red to draw the low quality class (quality ) and
blue to draw the high quality class (quality ).
- Without quitting R, loadthe winequality-white.csv
file into the workspace.
- Merge the red wine data with the white wine data and show
the histogram of ``quality''
for all the wines. Change the bin size to 2 and observe the difference.
- Create a data frame by using the first 50 records of red
wines and
the first 50 records of white wines, use a parallel coordinates plots
to draw this data using the following four attributes: ``citric
acid'', ``residual sugar'',
``density'', and ``quality''.
- Download the ``adult'' data set from the UCI repository http://archive.ics.uci.edu/ml/machine-learning-databases/adult/.
Make sure to download and examine the ``adult.names'' file.
- Load the training set data ``adult.data'' into the
workspace.
- Show a histogram of ``race'' for people whose
native country
is ``United-States''.
- Show a box plot of ``education-num'', ``capital-gain''
and
``hours-per-week''.
- Convert ``workclass'' and ``race'' to be
numerical attributes (integer). Use a three dimensional plot to show
the relationship between
``workclass'', ``race'', and ``hours-per-week''.
- More data exploration:
- Select any two attributes and show their joint histogram,
what conclusion
can you draw from the histogram?
- Select any three attributes and plot their relationship
using 2D scatter plot, use one of the selected attributes as the color
code when plotting,
what can you say about the correlation of these attributes?
- Save the commands you execute for the assignment in a program
file. Make sure to document the program using comments.
- Do not include in the submission large datasets that were
provided
by us.
- In plotting unsorted points, do not connect them with line
segments.
- Do not include repetitive results. Show only results that have a
purpose.
Please follow the following submission procedure:
- The programs you write must be written using R.
- Direct all questions/comments regarding the assignment to:
- On or before the due date upload the assignment submission into
blackboard as a zip file. Please do not send assignments via email.
- Report: prepared in a PDF or PS file. The report should
contain a summary of program design issues, description of specific
problems you faced and the way in which you solved them, and sample
input/output results (text/graphic). The report needs to be
sufficiently detailed. Please do not submit MS-Word DOC files.
- Code: all the source code files that are necessary to execute
your program. Please do not submit data files of saved results.
Note: we must be able to view your report and execute your
program in order to grade it
- The organization of the submitted material should be as follows:
- Create a directory called:
first_last_ass#
where ``first''/``last'' is your name and ``#'' is the assignment
number.
- Inside this directory create three sub-directories called:
code
,
report
, data
. Place in these
directories the files you need to submit.
- Please do not use space inside file/directory names.
- Do not submit a paper copy of your report. You will be contacted
by email if some material is missing or if you will need to meet with
the TA.
- If you are late in the submission, upload it to blackboard and
send an email to
indicating that you have uploaded a late submission. ``late days'' will
be determined by your email date.
Gady Agam
2013-01-31