36-350: Data Mining, Fall 2003
www.stat.cmu.edu/~minka/courses/36-350/
Instructor:
Tom Minka, Statistics Dept, Baker Hall 228D, minka@stat.cmu.edu
Teaching Assistant:
Fang Chen, Baker Hall A60D, fangc@stat.cmu.edu
Lectures: Monday and Wednesday, 10:30-11:20, CFA 211
Computer labs: Friday, 10:30-11:20, Baker 140F
Overview
Data mining is the conversion of data into knowledge. Advances of
computing technology have led to great opportunities in collecting and
analyzing data, which is now a major part of science, medicine, business,
and government. The purpose of this course is to help you take advantage of
these opportunities.
Data mining has significant overlap with statistics and machine learning,
but is different in its procedure. Statistics and machine learning
methods provide powerful microscopes for examining specific phenomena.
Data mining is the systematic use of these microscopes to find `nuggets' of
value in a mountain of data.
Course Objectives
The aim of the course is to provide you with a comprehensive introduction
to contemporary data mining practice and principles.
You will learn to:
-
Determine which method to apply in a given situation,
-
Program statistical software to carry out the method, and
-
Communicate the results in terms relevant to
science, business, etc.
Schedule
-
Searching for similar objects
-
Searching the web, document collections, and image databases
-
Clustering and segmentation
- Organizing multimedia (documents and images),
market segmentation (houses and cars)
-
Visualizing and exploring data
- Multivariate geometry, projection, parallel plots
-
Predictive modeling
- Modeling prices, predicting sales
-
Characterizing subgroups
- Profitable customers, good investments, junk e-mail
-
Modeling time trends
- The environment, economic trends, individual customer purchases
Format
-
The course is taught in a lecture format on Monday and Wednesday and
hands-on practice in a Friday computer lab.
-
Lectures will contain the material needed to complete the homeworks and lab
assignments. There is no text for the course.
Consequently, attendance and participation in class is
critical for learning.
-
There will be a weekly office hour
where you can meet one-on-one with the teaching assistant or instructor.
Its schedule will be determined at the beginning of the semester.
-
In the computer labs, you will learn how to use different data mining
methods on a dataset. Computer labs are mandatory. Each Friday, a lab
assignment will be handed out that must be
completed during the lab period.
There is nothing to hand in for the assignment; instead,
you must get the attention of the lab assistants
who will check your results and give you credit for the lab.
If you can't finish in time,
you will get partial credit for what you have completed.
Computer labs will use a free software package called R, which is similar to S-plus.
Unlike Data Desk or Minitab, R is a full-fledged programming language, and
can perform an unlimited set of operations.
-
Homework will be assigned weekly. The purpose of these assignments is to
improve your understanding of the methods and their results.
They are also scheduled to encourage you to keep up with the class.
Homework will involve answering questions
related to the lectures and lab assignment.
It will be posted on the web
and can be handed in or emailed to us.
Reading and homework should take about six hours per week.
-
There will be a comprehensive final examination for the course,
during the final exam period.
Final grade breakdown:
-
Homework: 30%
-
Labs: 30%
-
Final exam: 40%
Each homework assignment will be worth 100 points. These points will be
divided approximately equally among each part of the assignment, according
to difficulty.
The lowest homework grade will be dropped except if it is the last
assignment of the semester which is mandatory. The remaining homework
grades will be used to compute the homework average.
The same procedure is used for computer lab grades.
Extensions:
- The standard extensions
(medical, university event, or religious holiday)
must be accompanied by an official form as described in the student handbook.
- One un-official homework/lab extension may be taken
during the term.
- Other late homework will have points deducted, at the descretion of the
instructor.
All work and computer code must be your own.
Sharing code or results will result in zero credit and a letter to your dean.
See the CMU Student Handbook
on
Cheating
and Plagiarism.