CS-422 Data Mining

Spring 2013 (Tue, Thu 5:00pm - 6:15pm, LS-111)


Administrative information

Instructor

Teaching assistant

Grading

name comment weight
participation up to 4 unjustified missed classes $\Rightarrow$ full credit 5%
assignment 1 exploring data 5%
assignment 2 decision trees 5%
assignment 3 classification 5%
assignment 4 association analysis 5%
assignment 5 clustering 5%
assignment 6 anomaly detection 5%
assignment 7 mining network data 5%
midterm exam open notes (one paper notebook/binder) 15%
final exam open notes (one paper notebook/binder) 45%
total   100%

Notes:

  1. There is an additional mandatory assignment (assignment 0) which does not carry any credit. There is a penalty of 5% for not submitting this assignment.
  2. A certain percentage of the students may be invited to discuss their assignments.
  3. Late days: there is a total of 4 ``late days'' for all the assignments. After that 1 late day = -10%. Late days do not include weekends and university holidays. The final project can not be late. Assignments can not be submitted after classes end.
  4. Each member of this course bears responsibility for maintaining the highest standards of academic integrity. All breaches of academic integrity must be reported immediately. Copying of programs from any source (e.g. other students or the web) is considered to be a serious breach of academic integrity.

Course outline

What to expect from this course

Data mining can be covered at different levels. The focus of this course is the understanding of algorithms and techniques used in data mining. Students in the course are expected to write computer programs implementing different techniques taught in the course. The course requires mathematical background and some programming experience. This course does not intend to teach how to use a specific software application. While software API will be used in the course it is by no means the primary goal of this course.

Objectives

Outline

  1. Introduction: overview of data mining, data mining tasks, data mining software (TSK ch. 1)

  2. Processing and visualizing data: data types, data quality, data preprocessing, measures of similarity, visualization (TSK ch. 2-3)

  3. Decision trees: decision tree induction, overfitting, evaluating performance, comparing classifiers (TSK ch. 4)

  4. Classification: rule-based classifiers, nearest-neighbor classifiers, Bayesian classifiers, support vector machines, neural networks, ensemble methods (TSK ch. 5)

  5. Association analysis: frequent itemset generation, rule generation, compact representation, evaluation, categorical attributes (TSK ch. 6-7)

  6. Cluster analysis: K-means, agglomerative hierarchical clustering, DBSCAN, evaluation, self-organizing maps. (TSK ch. 8-9)

  7. Anomaly detection: causes, statistical approaches, proximity methods, density methods, clustering methods. (TSK ch. 10)

  8. Network data (time permitting): centrality and prestige, citation analysis, PageRank, authorities and hubs.

Required text

  1. Introduction to Data Mining. P.-N. Tan, M. Steinbach, and V. Kumar. Pearson education 2006.

Additional references

  1. Data Mining: Practical Machine Learning Tools and Techniques. I. H. Witten, E. Frank, and Mark A. Hall. Morgan Kaufmann 2011.
  2. Data Mining: Concepts and Techniques. J. Han, M. Kamber, and J. Pei. Morgan Kaufmann 2011.

Tentative schedule


class date topic assignment





1 01/15 Introduction to data mining AS0
2 01/17 data

3 01/22


4 01/24 exploring data AS1
5 01/29


6 01/31


7 02/05 No class

8 02/07 No class

9 02/12 decision trees AS2
10 02/14


11 02/19


12 02/21 classification

13 02/26
AS3
14 02/28


15 03/05


16 03/07 Midterm

17 03/12 association analysis AS4
18 03/14


19 03/19 No class (spring break)

20 03/21 No class (spring break)

21 03/26


22 03/28
AS5
23 04/02 clustering

24 04/04


25 04/09


26 04/11
AS6
27 04/16 anomaly detection

28 04/18


29 04/23


30 04/25 graph data

31 04/30
AS7
32 05/02


33 05/07 Final exam: 5:00pm-7:00pm (LS-111)


Gady Agam 2013-01-16