What is Data Mining?

Data mining is the process of discovering previously unknown, actionable and profitable information from large consolidated databases, often to support business decisions as well as scientific and medical discoveries.

How does Data Mining differ from statistics?

Data mining has significant overlap with statistics and machine learning, but is different in its procedure. Statistics and machine learning provide powerful microscopes for examining specific phenomena. Data mining is the systematic use of these microscopes to find `nuggets' of value in a mountain of data.

Unfortunately, the loose structure of data mining goals has often been accompanied by a lack of structure and precision in data mining methodology, with the result that only the most glaring features of the data are found.

In this course we will follow a middle ground, emphasizing visualization and nonparametric techniques while also employing parametric models as tools to aid visualization (e.g. by removing dominant effects). The alternation between nonparametric and parametric techniques is a key component of our strategy.

Frequently the emphasis in statistics and machine learning is on finding the simplest, most automated tool in our bag of tricks to achieve the goal (classify objects correctly, show that a difference is significant, that two variables are dependent, etc). In this process, one ignores certain variables, outlying cases, or unusual cases as necessary. We will also learn to do all of those things, but only to temporarily defocus aspects of the data so that we may focus on something else. In data mining, the outlying cases may be exactly the cases we are interested in.

Tom Minka
Last modified: Mon Aug 27 15:07:57 EDT 2001