CS-422 - Homework 2 (5%)

Decision Trees

Due by: March 5, 2013

Assignment Specifications

In this assignment you will implement techniques for decision tree classifiers, and apply them to classify a sampled spam email data set. Sample data files are available at http://archive.ics.uci.edu/ml/datasets/Spambase. The grade for this assignment will be based on your implementation of the algorithms, the thoroughness of your evaluation of the algorithm and the results you obtain, and the clarity of your report.

  1. Test the decision tree classifier provided by R (rpart provided in the rpart package).

    1. Load the data and check the attributes of the data, get an idea to the complexity of the problem.
    2. Choose the first 80% of the data for training and the remaining 20% data for testing.
    3. Use rpart function to create a tree using the training data.
    4. Use predict function to apply the generated tree to training and testing data, respectively. Produce the confusion matrix and calculate accuracy, precision, and recall.
    5. Use the prune function to prune the generated tree, and repeat step (d) again.
    6. Plot the generated tree (pruned) using the plot function or another function of your choice.
  2. Implement your own decision tree induction algorithm and prune function.

    1. Make your tree induction function support different inpurity measures (Gini index, entropy, misclassification error).
    2. Use the same training data as above to create a decision tree using your implementation.
    3. Since your implementation can return a different data structure from that the function predict (in the rpart package) uses, implement your own function to apply your tree to classify data (the predict step) and return the class label.
    4. Produce the confusion matrix and calculate the accuracy, precision and recall on the training data and testing data, respectively.
    5. Your prune function can be based on the pessimistic error, or minimum description length measures.
    6. Prune the generated tree using your prune function, compare the pruned tree with that produced using rpart package.
    7. Generate trees using different impurity measures and observe the difference of the generated trees (pruned using the same standard).
    8. evaluate different levels of pruning and compare the results.
    9. Plot the generated tree (pruned) using plot function or another function of your choice.

General comments



Gady Agam 2013-02-21