CS-422 - Homework 2 (5%)
Decision Trees
Due by: March 5, 2013
In this assignment you will implement techniques for decision tree classifiers,
and apply them to classify a sampled spam email data set. Sample data
files are available at http://archive.ics.uci.edu/ml/datasets/Spambase.
The grade for this assignment will be based on your implementation of the algorithms, the thoroughness
of your evaluation of the algorithm and the results you obtain, and the clarity
of your report.
- Test the decision tree classifier provided by R (rpart
provided in the rpart package).
- Load the data and check the attributes of the data, get an idea to
the complexity of the problem.
- Choose the first 80% of the data for training and the remaining 20%
data for testing.
- Use rpart function to create a tree using the training data.
- Use predict function to apply the generated tree to training
and testing data, respectively. Produce the confusion matrix and
calculate accuracy, precision, and recall.
- Use the prune function to prune the generated tree, and repeat step
(d) again.
- Plot the generated tree (pruned) using the plot function or another
function of your choice.
- Implement your own decision tree induction algorithm and prune function.
- Make your tree induction function support different inpurity measures
(Gini index, entropy, misclassification error).
- Use the same training data as above to create a decision tree using your
implementation.
- Since your implementation can return a different data structure from
that the function predict (in the rpart package) uses, implement
your own function to apply your tree to classify data (the predict
step) and return the class label.
- Produce the confusion matrix and calculate the accuracy, precision
and recall on the training data and testing data, respectively.
- Your prune function can be based on the pessimistic error, or minimum
description length measures.
- Prune the generated tree using your prune function, compare the pruned
tree with that produced using rpart package.
- Generate trees using different impurity measures and observe the difference
of the generated trees (pruned using the same standard).
- evaluate different levels of pruning and compare the results.
- Plot the generated tree (pruned) using plot function or another
function of your choice.
- You are advised (but not required) to use R for implementing the
decision tree algorithms.
- Write your code in a modular way using functions and make sure to
document it.
- Do not include in the submission large datasets that were provided
by us.
- Do not include repetitive results. Show only results that have a purpose.
- Follow the electronic submission instructions of assignment 1.
Gady Agam
2013-02-21