36350 Data Mining, Fall 2002
www.stat.cmu.edu/~minka/courses/36350/
Instructor:
Tom Minka, Statistics Dept, Baker Hall 228D, minka@stat.cmu.edu
Teaching Assistant:
Fang Chen, Baker Hall A60D, fangc@stat.cmu.edu
Lectures: Monday and Wednesday, 10:3011:20, CFA 211
Computer labs: Friday, 10:3011:20, Baker 140F
Overview
Data mining is the conversion of data into knowledge. Advances of
computing technology have led to great opportunities in collecting and
analyzing data, which is now a major part of science, medicine, business,
and government. The purpose of this course is to help you take advantage of
these opportunities.
Data mining has significant overlap with statistics and machine learning,
but is different in its procedure. Statistics and machine learning
methods provide powerful microscopes for examining specific phenomena.
Data mining is the systematic use of these microscopes to find `nuggets' of
value in a mountain of data.
Course Objectives
The aim of the course is to provide you with a comprehensive introduction
to contemporary data mining practice and principles.
You will learn to:

Determine which method to apply in a given situation,

Program statistical software to carry out the method, and

Communicate the results in terms relevant to
science, business, etc.
Schedule

Searching for similar objects

Searching the web, document collections, and image databases

Visualizing and exploring data
 Multivariate geometry, projection, parallel plots

Clustering and segmentation
 Customer profiling, market segmentation, changepoints in time

Predictive modeling
 Modeling prices, predicting sales

Characterizing subgroups
 Profitable customers, fraudulent activity, junk email

Finding patterns and rules
 Market basket analysis, demographic associations
Format

The course is taught in a lecture format on Monday and Wednesday and
via handson practice on a Friday computer lab.

Lectures will contain the material needed to complete the homeworks and lab
assignments. There is no text for the course.
Consequently, attendance and participation in class is
critical for learning.

In the computer labs, you will learn how to use different data mining
methods on a dataset. Computer labs are mandatory. Each Friday, a LAB
ASSIGNMENT will be handed out in class (and posted on the web) that must be
turned in (or emailed to us) at the end of class.
Labs can only be missed for the reasons described in the student handbook
(medical, university event, or religious holiday).
Computer labs will use a free software package called R, which is similar to Splus.
Unlike Data Desk or Minitab, R is a fullfledged programming language, and
can perform an unlimited set of operations.

HOMEWORK will be assigned weekly. The purpose of these assignments is to
improve your understanding of the methods and their results.
They are also scheduled to encourage you to keep up with the class.
Homework will involve answering questions
related to the lectures and lab assignment.
While the lab mainly involves
generating data summaries and statistics, homework is focused on
interpretation.
It will be posted on the web
and can be handed in or emailed to us.
Reading and homework should take about six hours per week.

There will be a comprehensive FINAL EXAMINATION for the course,
during the final exam period.
Final grade breakdown:

Homework: 30%

Labs: 30%

Final exam: 40%
Late homework will not be accepted without a written medical excuse.
Each homework assignment will be worth 100 points. These points will be
divided approximately equally among each of the parts of the assignment.
The lowest homework grade will be dropped except if it is the last
assignment of the semester which is mandatory. The remaining homework
grades will be used to compute the homework average.
The same procedure is used for computer lab grades.
All work and computer code must be your own.
Sharing code or results will result in zero credit and a letter to your dean.
See the CMU Student Handbook
on
Cheating
and Plagiarism.