CS-422 Data Mining

Spring 2013 (Tue, Thu 5:00pm - 6:15pm, LS-111)

Administrative information

Instructor

Gady Agam, , x7-5834
Office hours (SB-237e): Tue, Thu, 6:30pm - 7:30pm

Teaching assistant

Lin Gan, , x7-5705
Office hours (SB-115): Mon 4:30pm - 6:00pm, Wed 11:00am - 12:30pm

Grading

name	comment	weight
participation	up to 4 unjustified missed classes $\Rightarrow$ full credit	5%
assignment 1	exploring data	5%
assignment 2	decision trees	5%
assignment 3	classification	5%
assignment 4	association analysis	5%
assignment 5	clustering	5%
assignment 6	anomaly detection	5%
assignment 7	mining network data	5%
midterm exam	open notes (one paper notebook/binder)	15%
final exam	open notes (one paper notebook/binder)	45%
total		100%

Notes:

There is an additional mandatory assignment (assignment 0) which does not carry any credit. There is a penalty of 5% for not submitting this assignment.
A certain percentage of the students may be invited to discuss their assignments.
Late days: there is a total of 4 ``late days'' for all the assignments. After that 1 late day = -10%. Late days do not include weekends and university holidays. The final project can not be late. Assignments can not be submitted after classes end.
Each member of this course bears responsibility for maintaining the highest standards of academic integrity. All breaches of academic integrity must be reported immediately. Copying of programs from any source (e.g. other students or the web) is considered to be a serious breach of academic integrity.

Course outline

What to expect from this course

Data mining can be covered at different levels. The focus of this course is the understanding of algorithms and techniques used in data mining. Students in the course are expected to write computer programs implementing different techniques taught in the course. The course requires mathematical background and some programming experience. This course does not intend to teach how to use a specific software application. While software API will be used in the course it is by no means the primary goal of this course.

Objectives

Provide overview of data mining.
Provide understanding of mathematical concepts and algorithms used in data mining.
Provide programming experience for developing and implementing data mining applications.
Exercise communication skills via written assignment reports.
Note: Students will be required to submit programming assignments which include code implementation and written reports. The grade for assignments will be based on the quality of the program implementation and the quality of the written reports.

Outline

Introduction: overview of data mining, data mining tasks, data mining software (TSK ch. 1)
Processing and visualizing data: data types, data quality, data preprocessing, measures of similarity, visualization (TSK ch. 2-3)
Decision trees: decision tree induction, overfitting, evaluating performance, comparing classifiers (TSK ch. 4)
Classification: rule-based classifiers, nearest-neighbor classifiers, Bayesian classifiers, support vector machines, neural networks, ensemble methods (TSK ch. 5)
Association analysis: frequent itemset generation, rule generation, compact representation, evaluation, categorical attributes (TSK ch. 6-7)
Cluster analysis: K-means, agglomerative hierarchical clustering, DBSCAN, evaluation, self-organizing maps. (TSK ch. 8-9)
Anomaly detection: causes, statistical approaches, proximity methods, density methods, clustering methods. (TSK ch. 10)
Network data (time permitting): centrality and prestige, citation analysis, PageRank, authorities and hubs.

Required text

Introduction to Data Mining. P.-N. Tan, M. Steinbach, and V. Kumar. Pearson education 2006.

Additional references

Data Mining: Practical Machine Learning Tools and Techniques. I. H. Witten, E. Frank, and Mark A. Hall. Morgan Kaufmann 2011.
Data Mining: Concepts and Techniques. J. Han, M. Kamber, and J. Pei. Morgan Kaufmann 2011.

Tentative schedule

class date topic assignment

1 01/15 Introduction to data mining AS0

2 01/17 data

3 01/22

4 01/24 exploring data AS1

5 01/29

6 01/31

7 02/05 No class

8 02/07 No class

9 02/12 decision trees AS2

10 02/14

11 02/19

12 02/21 classification

13 02/26
AS3

14 02/28

15 03/05

16 03/07 Midterm

17 03/12 association analysis AS4

18 03/14

19 03/19 No class (spring break)

20 03/21 No class (spring break)

21 03/26

22 03/28
AS5

23 04/02 clustering

24 04/04

25 04/09

26 04/11
AS6

27 04/16 anomaly detection

28 04/18

29 04/23

30 04/25 graph data

31 04/30
AS7

32 05/02

33 05/07 Final exam: 5:00pm-7:00pm (LS-111)

class	date	topic	assignment

1	01/15	Introduction to data mining	AS0
2	01/17	data
3	01/22
4	01/24	exploring data	AS1
5	01/29
6	01/31
7	02/05	No class
8	02/07	No class
9	02/12	decision trees	AS2
10	02/14
11	02/19
12	02/21	classification
13	02/26		AS3
14	02/28
15	03/05
16	03/07	Midterm
17	03/12	association analysis	AS4
18	03/14
19	03/19	No class (spring break)
20	03/21	No class (spring break)
21	03/26
22	03/28		AS5
23	04/02	clustering
24	04/04
25	04/09
26	04/11		AS6
27	04/16	anomaly detection
28	04/18
29	04/23
30	04/25	graph data
31	04/30		AS7
32	05/02
33	05/07	Final exam: 5:00pm-7:00pm (LS-111)

Gady Agam 2013-01-16