Statistical Learning Overview

Examples of statistical learning applied at the Media Lab:

User interfaces: modelling subjectivity and affect, intelligent agents, transduction (input from camera, microphone, or fish sensor)
Recovering visual models: face recognition, model-based video, avatars
Dynamical systems: speech recognition, visual tracking, gesture recognition, virtual instruments
Probabilistic modeling: image compression, low bandwidth teleconferencing, texture synthesis

All can be placed somewhere in this taxonomy, ordered from specific to general:

Function learning

Learn a mathematical function from (x, y) pairs, e.g. the best fit line. Also known as regression. Classification is the special case where y is nominal (the values have no intrinsic meaning or ordering).

Training may be based on reinforcement, as common in real life, rather than explicit example. However, reinforcement can be reduced to the case of explicit examples, using dynamic programming.

Regression can be reduced to surface learning or conditional probability learning, which are therefore harder.

Surface learning

Learn a mathematical relation from (x, y) pairs, e.g. the best fit circle. An x may be associated with zero, one, or multiple y's.

Can be reduced to unconditional probability learning, which is therefore harder.

Probability learning

Learn a conditional or unconditional PMF or PDF from occurrences, e.g. guess the bias of a coin given the outcome of flips. Also known as unsupervised learning. Conditional PDFs can be represented by unconditional PDFs and vice versa. Common techniques are to use a mixture (sum) of simple PDFs of the same type, to multiply simple PDFs, or to convolve simple PDFs (add the corresponding random variables).

This is the hardest of all. Consider that you never directly observe a probability. Probability learning is also the most lucrative, since it subsumes the rest. The learning process itself can be modeled with probabilities (so-called Bayesian learning).

Two fundamental questions in statistical learning:

Incorporating prior knowledge: For example, how can a learning algorithm take advantage of the fact that the functions in the domain are periodic? quasi-periodic? smooth? How can we represent and visualize such knowledge?
Recovering prior knowledge: Can learning algorithms help design or improve themselves? That is, can they recover the important qualities of the domain and change themselves to exploit these qualities? How can they evaluate competing domain models? Do we need to answer the previous question first?

Note the similarity with compression of information. The first issue is how to design an encoder/decoder pair given knowledge of the information's distribution. This was solved by Huffman and arithmetic coding. The second issue is how to discover this distribution automatically. This is still unsolved but Lempel-Ziv does a good job. Many coding algorithms, especially image coders, unfortunately do not distinguish these two issues; the same is true for many learning algorithms. I think learning research should strive to recover the distinction and embrace probabilistic models as the representation of a domain.

How the reading group papers fit in:

Incomplete data: How to deal with incomplete examples during supervised learning. This arises in meta-learning and vision (occlusion). An example of reducing regression to conditional density estimation and this to unconditional density estimation.
Model-based interpolation: Applying a surface or probability density once you've estimated it.
Family learning: An example of meta-learning. Meta-learning as a density estimation problem. Meta-learning via regularization. Reducing a surface to a probability density.
Canonical Distortion Measure: Analyzing the relationship between the parameters of a learning algorithm and the assumed distribution of functions (fundamental issue #1). How a distance metric can encode such a distribution. How to learn a distance metric directly, without a density estimation step (fundamental issue #2). How the hierarchical structure and sharing used in meta-learning is useful in base-level learning.
Style and Content: Applying a simple, general-purpose probabilistic model. Taxonomy of meta-learning tasks. Meta-learning by linear combination. Applying incomplete data algorithms to meta-learning. How to relax the same-distribution assumption.
Factorial learning: Different ways to do unsupervised learning. Why unsupervised learning is unconditional density estimation with latent variables, via a compression argument. The autoencoder approach to recovering latent variables. Representations of unconditional densities: undirected (Boltzmann machine) and directed Bayesian networks.

Thomas Minka

Last modified: Fri Oct 3 12:42:39 EDT 1997