A Statistical Learning/Pattern Recognition Glossary
Welcome to my glossary. It is inspired by Brian Ripley's glossary in "Pattern
Recognition for Neural Networks" (and the need to save time explaining
things).
Other links of interest:
Fundamental concepts
- Statistical Learning/Pattern Recognition
- An approach to machine intelligence which is based on statistical
modeling of data. With a statistical model in hand, one applies
probability theory and decision theory to get an algorithm.
This is opposed to using training data merely to select among different
algorithms or using heuristics/"common sense" to design an algorithm.
- Features
- The measurements which represent the data.
The statistical model one uses is crucially dependent on the choice of
features. Hence it is useful to consider alternative representations of
the same measurements (i.e. different features). For example, different
representations of the color values in an image.
General techniques for finding new representations include discriminant
analysis, principal component analysis, and clustering.
- Classification
- Assigning a class to a measurement, or equivalently, identifying the
probabilistic source of a measurement. The only statistical model
that is needed is the conditional model of the class variable given
the measurement. This conditional model can be obtained from a joint
model or it can be learned directly. The former approach is
generative since it models the measurements in each class. It
is more work, but it can exploit more prior knowledge, needs less
data, is more modular, and can handle missing or corrupted data.
Methods include mixture models and Hidden Markov Models. The latter
approach is discriminative since it focuses only on
discriminating one class from another. It can be more efficient once
trained and requires fewer modeling assumptions. Methods include logistic
regression, generalized linear classifiers, and nearest-neighbor.
See "Discriminative vs Informative Learning".
- Regression
- Predicting the value of random variable y from measurement
x. For example, predicting engine efficiency based on
oil pressure.
Regression generalizes classification since y can be any
quantity, including a class index.
Many classification algorithms can be understood as thresholding the output
of a regression.
Like classification, one can obtain the conditional model of y
from a joint model (which includes a model of x) or it can be
learned directly. Curve fitting is the common special case where
y is assumed to be a deterministic function of x,
plus additive noise (usually Gaussian). Methods for curve fitting include
radial basis functions, feed-forward neural networks, and mixtures of
experts.
- Nonparametric regression/density estimation
- An approach to regression/density estimation that doesn't require much
prior knowledge but only a large amount of data. For regression, it includes
nearest-neighbor, weighted average, and locally weighted regression. For
density estimation, it includes histograms, kernel smoothing, and
nearest-neighbor.
- Parameter Estimation
- Density estimation when the density is assumed to be in a
specific parametric family.
Special cases include maximum likelihood, maximum a posteriori,
unbiased estimation, and predictive estimation.
See the section on Parameter estimation techniques.
- Model selection
- Choosing the parametric family to use for density estimation.
This is harder than parameter estimation since you have to
take into account every member of each family in order to choose the best
family. Considering only the best member of each family is not sufficient
(one would tend to choose the biggest family).
See the section on Model selection techniques.
- Independence Diagram
- A graphical way of expressing the conditional independence
relationships among a set of random variables.
They cannot encode every possible form of conditional independence but they
go a long way toward this end. They are also called "Bayesian networks."
See "Independence Diagrams",
A Brief Introduction to Graphical Models and Bayesian Networks,
Course Notes on Bayesian Networks,
and
Pearl.
- Active learning
- Determining the optimal measurements to make under a cost constraint.
A measurement is "optimal" when it is expected to give the most new information
about the parameters of a model.
Active learning is thus an application of decision theory to the process of
learning.
It is also known as experiment design.
See
"Employing
EM in Pool-Based Active Learning for Text Classification",
"Selective
sampling using the Query by Committee algorithm",
"Reinforcement
Learning: A Survey",
Box&Draper,
and
Raiffa&Schlaifer.
- Reinforcement learning
- Learning how to act optimally in a given environment, especially with
delayed and nondeterministic rewards. It is equivalent to adaptive
control. There are two interleaved tasks: modeling the environment and
making optimal decisions based on the model. The first task is a
statistical modeling problem and is handled using the techniques listed in
this glossary. The second task is a decision theory problem: converting
the expectation of delayed reward into an immediate action. Since
reinforcement learning requires exploration, it is often combined with
active learning, though this is not essential.
Most learning problems that humans face are reinforcement learning problems,
e.g. deciding which melon to buy, which coat to wear outside today, or which
friends to have.
See "Reinforcement Learning: A Survey" and
"Reinforcement Learning: A Tutorial".
- No free lunch
- The point that all statistical models are necessarily biased in one
way or another, and that no single bias is globally optimal.
Mitchell, and later Wolpert, emphasized this point in order to stop
useless comparisons between learning algorithms that were using different
priors (like Euclidean nearest neighbor vs. axis-parallel decision trees).
The real way to evaluate algorithms is how well they can utilize
prior knowledge given to them, i.e. how well they can approximate Bayesian
learning.
See
"The Need for Biases in Learning Generalizations" in Readings in Machine Learning,
"The lack of a priori distinctions between learning algorithms",
"Bayesian regression filters and the issue of priors" (the "issue of priors" part),
and Cross-validation.
- Discriminant Analysis
- Constructing new features via linear combination so that
classification is easier. The notion of "easier" is quantified by
Fisher's criterion, which applies exactly to Gaussian classes with
equal variance and approximately to other models. Variants like
Flexible discriminant analysis consider nonlinear combinations
as well. See Duda&Hart,
"Eigenfaces
vs. Fisherfaces", and "Flexible
Discriminant and Mixture Models".
- Principal Component Analysis
- Constructing new features which are the principal components of a data
set. The principal components are random variables of maximal
variance constructed from linear combinations of the input features.
Equivalently, they are the projections onto the principal component
axes, which are lines that minimize the average squared distance
to each point in the data set. To ensure uniqueness, all of the
principal component axes must be orthogonal. PCA is a
maximum-likelihood technique for linear regression in the presence of
Gaussian noise on both inputs and outputs. In some cases,
PCA corresponds to a Fourier transform, such as the DCT used in JPEG
image compression. See "Eigenfaces for recognition"
(Turk&Pentland, J Cognitive Neuroscience 3(1), 1991),
Bishop,
"Probabilistic Principal Component Analysis",
and
"Automatic choice of dimensionality for PCA".
- Principal curve
- A nonlinear principal component axis. Principal curves are smooth
curves that minimize the average squared orthogonal distance to each
point in a data set. Fitting a principal curve is a
maximum-likelihood technique for nonlinear regression in the presence
of Gaussian noise on both inputs and outputs. Principal
points are individual points that minimize the average distance to
each point in a data set (they are the output of k-means). See the
Principal Curve Website.
- Factor analysis
- A generalization of PCA which is based explicitly on
maximum-likelihood. Like PCA, each data point is assumed to arise
from sampling a point in a subspace and then perturbing it with
full-dimensional Gaussian noise. The difference is that factor
analysis allows the noise to have an arbitrary diagonal covariance
matrix, while PCA assumes the noise is spherical. In addition to
estimating the subspace, factor analysis estimates the noise
covariance matrix. See
"The EM
Algorithm for Mixtures of Factor Analyzers".
- Independent Component Analysis
- Constructing new features which are the independent components of a
data set. The independent components are random variables of
minimum entropy constructed from linear combinations of the input features.
The entropy is normalized by the variance of the component, so absolute scale
doesn't matter.
It is a fact of information theory that such variables will be as
independent as possible. This feature extraction technique is closely
related to exploratory
projection pursuit, commonly used for visualization. See
"Survey on Independent Component Analysis",
"Fast and Robust Fixed-Point Algorithms for Independent Component Analysis" (with software),
"Independent Component Analysis: A flexible non-linearity and decorrelating manifold approach"
and the
ICA Web site.
- Clustering
- Grouping similar objects in a multidimensional space. It is useful
for constructing new features which are abstractions of the existing
features. Some algorithms, like k-means, simply partition the feature space. Other
algorithms, like single-link agglomeration, create nested
partitionings which form a taxonomy. Another possibility is to learn
a graph structure between the clusters, as in the Growing
Neural Gas. The quality of the clustering depends crucially on
the distance metric in the space. Most techniques are very sensitive
to irrelevant features, so they should be combined with feature
selection. See Duda&Hart,
and, for an interesting application, "Clustering
sequences with hidden Markov models".
- K-means
- A parametric algorithm for clustering data into exactly k
clusters. First, define some initial cluster parameters. Second, assign
data points to clusters. Third, recompute better cluster parameters, given
the data assignment. Iterate back to step two. It is related to
the Expectation-Maximization algorithm for fitting a mixture of Gaussians.
See "Hard and Soft Assignment Methods for Clustering", and
Duda&Hart.
- Jarvis&Patrick clustering
- A generalization of single-link agglomeration that can be more robust to noise.
Unlike k-means, it can find clusters of arbitrary shape.
See
on-line description.
- Feature selection
- Not extracting new features but rather removing features which seem
irrelevant for modeling. This is a combinatorial optimization problem.
The direct approach (the "wrapper" method) retrains and re-evaluates a
given model for many different feature sets.
An approximation (the "filter" method) instead optimizes simple criteria which
tend to improve performance.
The two simplest optimization methods are forward selection (keep
adding the best feature) and backward elimination (keep removing
the worst feature).
See
Statistical models
(Each has parameters which can be estimated by a variety of methods.)
- Linear regression
- A conditional statistical model of random vector y given
measurement vector x, where y is a linear
transformation of x followed by additive noise, typically
Gaussian. That is, y = Ax + e. The matrix A is the
parameter of the model. See
"Bayesian linear regression".
- Basis function regression
- Linear regression applied to a new feature set, i.e. a new basis.
If f_i(x) is basis function i, then the
conditional model is y = sum_i a_i f_i(x) + e.
The parameters of the model are the weights a_i.
Polynomial regression and spline regression are special cases.
- Gaussian process regression
- A non-parametric regression model which is equivalent to basis function
regression with a Gaussian prior density on the coefficients.
The basis functions are implicitly
determined by the covariance function of a Gaussian process.
The advantage of this formulation is that the number of basis
functions is potentially infinite, and they can be adapted, making it
competitive with feed-forward neural network regression.
See "Gaussian
processes: A replacement for supervised neural networks?", Rasmussen's Web site,
and
Cressie.
- Radial Basis Function regression
- Basis function regression where each new feature is based on the
distance to a prototype, hence the basis is "radial." The resulting curve
is a superposition of "bumps," one at each prototype. Such models are
motivated by regularization theory and Gaussian process theory. The basis
functions are adapted by moving the prototypes or reshaping the bumps. See
"Introduction to
Radial Basis Function Networks",
"Regularization
Theory and Neural Networks Architectures", and
"How do MLPs compare with RBFs?".
- Generalized linear model
- Linear regression except the conditional density of y is
an arbitrary function of Ax, i.e. there is not simply additive
noise. For example, a Poisson model for y, where the Poisson
rate is equal to Ax, is a generalized linear model. Logistic
regression is also a kind of generalized linear model. The attraction
of such models is that the parameter A can be learned in a very
similar way as for linear regression. See McCullagh&Nelder.
- Logistic regression
- A conditional statistical model of binary variable y
given measurement vector x. The probability that y
is 1 is given by the logistic function applied to a linear combination of
x. That is, p(y=1) = 1/(1+exp(-a*x)).
Logistic regression is a generalized linear model (and is really
logistic linear regression).
The row vector a is the parameter of the model.
Logistic regression gives rise to a linear classifier.
See a logistic regression overview and
"Why the
logistic function?".
- Gaussian process classifier
- A non-parametric logistic regression model analogous to Gaussian process
regression. It is equivalent to using basis functions in
logistic regression.
See
"Bayesian Classification with Gaussian Processes" and Rasmussen's Web site.
- Feed-forward neural network regression
- Basis function regression with adaptive basis functions. Given a
measurement vector, each layer of the network makes a linear transformation
and then applies a nonlinearity to each vector component. For example, a
two-layer network has y = A2*f(A1*x) + e. The matrix
A2 contains the basis coefficients while the matrix
A1 adapts the basis. The nonlinearity f is fixed and
is typically chosen to be a smoothed step function. The resulting basis
contains oriented step functions with varying slope. See
Bishop and
"How many hidden layers should I use?".
- Feed-forward neural network density model
- Neural network regression except the conditional density of
y is an arbitrary function of the network output, i.e. there is
not simply additive noise. An example is logistic network
regression. Follows the same principle as a generalized linear
model.
- Additive regression
- A multivariate regression expressed as the sum of univariate
regressions. That is, y is a sum of functions applied to
the components of x, plus additive noise: y = sum_i
f_i(x_i) + e. When the conditional density of y is
an arbitrary function of sum_i f_i(x_i), such as a
logistic additive regression, then you have a generalized
additive regression. The parameters of the model are the
functions f_i, which are learned recursively by another
regression model. These models are very useful for visualization
(also see projection pursuit regression). See Hastie&Tibshirani
and Backfitting.
- Projection pursuit regression
- Additive regression on an adaptive feature space. That is, y =
sum_i f_i(a_i*x) + e where the row vector a_i takes a linear
combination of x. The parameters of the model are the axes
a_i as well as the functions f_i, which are learned
recursively. See "Projection
Pursuit Regression" and "Generalized
Projection Pursuit Regression".
- Robust regression
- A regression model which includes the possibility of measurement outliers.
It also refers to parameter estimation that can handle outliers. (But
changing your estimator without changing your model is inconsistent
behavior.)
See Venables&Ripley,
Barnett&Lewis,
code and papers at Antwerp,
and an illustrative
applet.
- Independent Feature Model (`Naive Bayes')
- Any statistical model which assumes that the components of the
measurement vector, i.e. the features, are conditionally independent
given the class. Like additive regression, this assumption allows
each component to be modeled separately. Independent Component
Analysis/Projection Pursuit can be used to extract better features for
the Independent Feature Model. See Duda&Hart,
"A
Comparison of Event Models for Naive Bayes Text Classification",
and "Beyond
independence: Conditions for the optimality of the simple Bayesian
classifier".
- Linear classifier
- Any classifier with straight-line decision boundaries in feature space.
Such classifiers can arise from many different statistical models,
especially Gaussian class-conditional densities. They can be implemented
using a single-layer neural network or Perceptron. See
Duda&Hart,
Bishop.
- Generalized linear classifier
- Any classifier implemented by subjecting data to a nonlinear
transformation before feeding it to a linear classifier.
The transformation may increase or decrease the number of features.
The new features are like basis functions in basis function regression,
and the classifier is essentially thresholding a basis function regression.
It can have arbitrarily shaped decision boundaries.
A Radial Basis Function classifier is a special case where the
transformation is based on distances to a set of prototypes (the basis
functions are "bumps").
See Duda&Hart.
- Support Vector Machine
- A generalized linear classifier with a maximum-margin fitting
criterion. This fitting criterion provides regularization which
helps the classifier generalize better. The classifier tends to ignore
many of the features. For example, a Support Vector Machine can be used as
a regularized Radial Basis Function classifier. See
- Finite mixture model
- Any probability density expressed as the weighted sum of simpler
densities. The simpler densities are the components or states of the
mixture. An equivalent definition is hierarchical sampling: to sample
from the density, first draw a state at random (using a distribution
over states) and then sample from that component. Using Gaussian
components leads to a mixture of Gaussians. The distribution
over states can itself be a mixture, leading to a deeper hierarchy,
which is a hierarchical mixture model. Any sort of stochastic
decision tree is a hierarchical mixture model. See
David Dowe's website,
"Cluster-Based
Probability Model and Its Application to Image and Texture
Processing", Bishop,
and Expectation-Maximization.
- Mixture of subspaces
- A mixture of Gaussians where each Gaussian models only a subset of the
features. The other features have a much smaller variance, producing
a flat, pancake-like density. In other words, the covariance matrix
of the Gaussian is nearly singular, reducing the number of parameters
to estimate. Each Gaussian applies some feature extraction technique
like PCA to determine the features to use. It is thus a combination
of clustering and feature extraction. See
"The EM
Algorithm for Mixtures of Factor Analyzers" (with software),
"Mixtures of Principal Component Analyzers",
"Mixtures
of local linear subspaces for face recognition", and "Modeling the Manifolds of
Images of Handwritten Digits".
- Coupled mixture model
- A joint statistical model for a set of variables, where each variable
has a mixture distribution. The mixtures are coupled by a joint
distribution over states. For example, you could constrain two
mixtures to never be in the same state. To sample the variables, you
first sample a state for each variable (using the joint distribution
over states) and then sample each variable from its chosen component.
Coupled mixture models include Hidden Markov Models and Hidden Markov
random fields.
- Markov chain
- Any multivariate probability density whose independence diagram is a
chain. In other words, the variables are ordered, and each variable
"depends" only on its neighbors in the sense of being conditionally
independent of the others. An equivalent definition is that you sample
the variables left-to-right, conditional only on the last outcome.
- Autoregression
- A Markov chain model for a sequence of variables, where the next
variable is predicted from the previous variable via regression, typically
linear.
- Hidden Markov Model
- A joint statistical model for an ordered sequence of variables. It is
the result of stochastically perturbing the variables in a Markov
chain (the original variables are thus "hidden"). The Markov chain
has discrete variables which select the "state" of the HMM at each
step. The perturbed values can be continuous and are the "outputs" of
the HMM. A Hidden Markov Model is equivalently a coupled mixture model
where the joint distribution over states is a Markov chain.
If you made a scatterplot of the data, ignoring the ordering, it would
look like a finite mixture. Only when you consider the ordering will
you see that the progression of states has Markov structure: the next
state depends on the previous state. See
- Autoregressive Hidden Markov Model
- A Hidden Markov
Model except that each output depends not only on the state but also
the previous output, usually via a regression model. This extension
can be implemented with surprisingly little extra work. (This differs
from Rabiner's terminology. Rabiner's "autoregressive HMM" refers to
a particular choice of emission density, and would now be called a
crude Hidden Markov Segment Model.) See "Hidden
Markov models using shared and global vector linear predictors",
"ML estimation of a stochastic linear system with the EM
algorithm and its application to speech recognition"
(Digalakis, Rohlicek, & Ostendorf, IEEE Trans Speech and Audio
1(4), 1993), and "The
Double Chain Markov Model".
- Input/Output Hidden Markov Model
- A Hidden Markov Model which is controlled by an input sequence.
The input sequence can influence the choice of state at each step
or it can influence the choice of output at each step.
See
"Markovian Models for Sequential Data"
and
"Input/Output HMMs for Sequence Processing".
- Factorial Hidden Markov Model
- The result of summing the outputs of several Hidden Markov Models.
More generally, any arithmetic operation could be used, as long as it only
applies to the outputs at a particular time.
It is equivalent to one large HMM with a factored state distribution:
the state is made up of independent parts.
Factoring the state distribution is a general approach to reducing the
number of parameters in a hidden variable model.
Summing the outputs of ordinary mixture models gives a factorial
mixture model.
See "Factorial
Learning and the EM Algorithm" and "Factorial
Hidden Markov Models" (software).
- Hidden Markov Decision Tree
- A Hidden Markov Model where the distribution over states is
hierarchical, as in a Hierarchical Mixture Model.
The hierarchical path chosen for the next state depends on the
hierarchical path for the previous state.
In other words, the outcome at a particular node in the hierarchy is
governed by a Markov chain.
Thus it is both a Factorial Hidden Markov Model and a Coupled Hidden
Markov Model.
See "Hidden Markov
Decision Trees".
- Switching Hidden Markov Model
- The result of switching among the outputs of several Hidden Markov Models.
That is, only one of the outputs is measured at any particular time.
The switching is controlled by a separate Markov chain.
See "Variational Learning for Switching State-Space Models".
- Hidden Markov Model with Duration
- A Hidden Markov Model where the next state depends not just on the
previous state, but on how long you have been in that state.
Equivalently, it is an HMM with a nonstationary state transition
matrix, that changes whenever you stay in the same state, and resets
when you leave the state. See
Rabiner&Juang
(Ch.6).
- Hidden Markov Segment Model
- A Hidden Markov Model where the outputs are not simply vectors but
sequences of arbitrary length ("segments"). The length of a segment
and the variables in a segment can have an arbitrary probability
distribution conditional on the state. In particular, the variables
can be correlated. For example, the segment model can itself be an
HMM, giving a Hierarchical
Hidden Markov Model. The segment model can be a Gaussian process
or (more specifically) a Linear Dynamical System. A Hidden Markov
Model with Duration is a special case. A common refinement is that
each segment influences the next segment, e.g. to provide continuity;
this is the analogue of an autoregressive HMM. See
- Linear Dynamical System
- A time-series model closely related to the Hidden Markov Model. It
generalizes the HMM in some ways and is more restricted in other ways. The
generalization is that there is a continuum of possible hidden states; it
is an infinite mixture.
The restrictions are (1) the state transitions are linear (plus Gaussian
noise) and (2) the observations are linear in the state value (plus
Gaussian noise). The Linear Dynamical System generalizes linear
autoregression in the same way that Hidden Markov Models generalize Markov
models. Hence it might be called hidden autoregression.
All variables in the sequence are jointly Gaussian: it is a special kind of
Gaussian process. See "From Hidden Markov Models to Linear Dynamical Systems",
Gelb (Ch.4),
and Kalman filtering.
- General dynamical system
- A general time-series model based on stochastically perturbing the
variables in a Markov chain. The Markov chain may be continuous, the
state transitions may be nonlinear with non-Gaussian noise, and the
observations may be a nonlinear function of the state plus
non-Gaussian noise. In such models, exact Bayesian inference is not
feasible. Instead we use special approximation techniques which
operate sequentially in time, such as extended Kalman filtering and Monte
Carlo filtering. Monte Carlo filtering uses Monte Carlo integration to solve the
necessary integrals at each step. See
- Mixture of Experts
- A finite mixture model for random variable y
where all the components and the distribution over components are
conditional on measurement x.
The mixture components are the "experts" and are typically
linear regressions.
A piecewise linear
regression is a special case where x selects a unique expert.
The distribution over experts can be hierarchical, as in a hierarchical
mixture model, giving a Hierarchical Mixture of Experts.
The measurement x influences the hierarchical path,
as in a decision tree.
See "Learning in
modular and hierarchical systems",
"A
statistical approach to decision tree modeling",
"Hierarchical
mixtures of experts and the EM algorithm", and
Steve Waterhouse's papers.
- Decision tree
- The deterministic ancestor of the hierarchical mixture of experts.
Used widely in Machine Learning, where the most emphasis is on
structuring the tree. See Murthy's
survey, Ripley,
and web site.
- Markov random field
- A set of random variables whose independence relationships are
represented by a neighborhood structure, i.e. an undirected graph.
The Markov property asserts conditional independence: given its
immediate neighbors in the graph, a variable is independent of all
other variables. This property is particularly useful for specifying
the conditional distribution of a single variable, making the
representation well suited to Gibbs sampling. Specifying all of the
conditional distributions completely determines the joint distribution
of the variables. This is a very general model, subsuming Markov
chains. The color values in an image or the temperature across an
object can be modeled as Markov random fields. When the variables are
binary, this model is a Boltzmann
machine. See Cressie,
Li,
Hertz (ch 7),
Kindermann,
and Relaxation
labeling.
- Simultaneous autoregression
- A Markov random field where the each
variable is predicted from its neighbors via regression, typically linear.
It generalizes autoregression from Markov chains to Markov random fields.
See Cressie.
- Hidden Markov random field
- The result of stochastically perturbing the variables in a Markov
random field (the original variables are thus "hidden"). The result
is technically still a Markov random field, but some variables are
never observed. It is equivalently a coupled mixture model where the
coupling distribution is a Markov random field. This generalizes a
Hidden Markov Model by including more dependencies between the hidden
states. A noisy image or depth map can be modeled as a Hidden Markov
random field. See the references on Markov random fields.
- Constrained mixture model
- A finite mixture model with a restrictive prior distribution on the
possible parameters of the components. Most importantly, the
component parameters are statistically dependent. For example, a mixture
of Gaussians where the means must form a circle or grid is a constrained
mixture. Constrained mixtures are useful for approximating principal
curves as well as for constrained clustering. Known historically as
a self-organizing map. See
- Constrained Hidden Markov Model
- A Hidden Markov Model with a restrictive prior distribution on the
possible parameters of the output densities, which makes them statistically
dependent. Equivalently, the Markov generalization of a constrained mixture.
Has its origins in speaker adaptation for speech recognition.
See
- Coupled Hidden Markov Model
- A set of Hidden Markov Models whose Markov chains influence each
other. There are two orthogonal ways to produce coupling. One is to
couple the simultaneous states of the Markov chains, e.g. constraining
them to always be in different states at any particular time. Another
way is to allow the next state of a chain to depend on the previous
state of another chain. Unlike a Factorial Hidden Markov Model, the
outputs of the Markov chains do not interact.
A Hidden Markov Decision Tree is both coupled and factorial.
See "Coupled
hidden Markov models for modeling interacting processes".
- Stochastic context-free grammar
- A joint statistical model for an arbitrarily-long sequence of
variables which generalizes a Markov chain.
Each sequence is generated, starting from a single symbol, by stochastically
applying the productions of a context-free grammar, until no more
productions apply. Like a Markov chain, the most likely
productions are easy to find for a given sequence,
yet the sequence can have long-range correlations.
See
- Stochastic program
- A joint statistical model expressed as an explicit sampling procedure.
Stochastic grammars are a special case.
See "Effective
Bayesian Inference for Stochastic Programs",
"Bellman Equations for Stochastic Programs",
"Nondeterministic Lisp as a Substrate for Constraint Logic Programming",
SCREAMER
(software),
and the modeling language of BUGS.
- Maximum likelihood
- A parameter estimation heuristic that seeks parameter values that
maximize the likelihood function for the parameter. This ignores the
prior distribution and so is inconsistent with Bayesian probability
theory, but it works reasonably well. See "Pathologies of Orthodox Statistics".
- Maximum A Posteriori
- A parameter estimation heuristic that seeks parameter values that
maximize the posterior density of the parameter. This implies that
you have a prior distribution over the parameter, i.e. that you are
Bayesian. MAP estimation has the highest chance of getting the
parameter exactly right. But for predicting future data, it can be
worse than Maximum Likelihood; predictive estimation is a better
Bayesian method for that purpose.
MAP is also not invariant to reparameterization; see MacKay's
Bayesian methods FAQ.
- Unbiased estimation
- A parameter estimation procedure based on finding an estimator function
that minimizes average error. When the average error is zero then the
estimator is "unbiased." The error of the function is averaged over
possible data sets, including ones you never observed. The best function
is then used to get parameter values. See "Pathologies of Orthodox Statistics".
- Predictive estimation
- Parameter estimation consistent with Bayesian probability theory.
It seeks to minimize the expected "divergence"
between the estimated distribution and the true distribution. The
divergence is measured by Kullback and Leibler's formula. The distribution
which achieves minimum divergence corresponds to integrating out
the unknown parameter. Hence predictive estimation can be approximated
by averaging over several different parameter choices.
See
"Inferring
a Gaussian distribution",
"A Comparison of
Scientific and Engineering Criteria for Bayesian Model Selection",
Geisser, and
Bishop.
- Minimum Message Length
- A parameter estimation technique similar to predictive estimation but
motivated by information theory. Consider compressing the data via a
two-part code: the first part is a parameter setting, encoded with
respect to the prior, and the second part is the data, encoded with
respect to the model with that parameter. Parameters are continuous,
and so cannot be encoded exactly---they must be quantized, which
introduces error. So we can't choose the parameters which simply
compress the data most; we have to choose parameters which compress
the data well even if the parameters are slightly modified.
The parameter setting which balances this tradeoff between accuracy
and robustness is the MML estimate.
See
Some related methods:
- Bootstrapping
- A technique for simulating new data sets, to assess the robustness
of a model or to produce a set of likely models.
The new data sets are created by re-sampling with replacement from the
original training set, so each datum may occur more than once.
See "What are cross-validation and bootstrapping?" and
"The Bootstrap is Inconsistent with Probability Theory".
- Bagging
- Bootstrap averaging. Generate a bunch of models via
bootstrapping and then average their predictions.
See "Bagging
Predictors",
"Why does bagging work?", and
"Bayesian model averaging is not model combination".
- Monte Carlo integration
- A technique for approximating integrals in Bayesian inference.
To approximate the integral of a function over a domain D,
generate samples from a uniform distribution over D and
average the value of the function at those samples. More generally,
we can use a non-uniform proposal distribution, as long as we weight
samples accordingly. This is known as importance sampling
(which is an integration method, not a sampling method). For
Bayesian estimation, a popular approach is to sample from the
posterior distribution, even though it is usually not the most efficient
proposal distribution. Gibbs sampling is typically used to generate the
samples. Gibbs sampling employs a succession of univariate
samples (a Markov Chain) to generate an approximate sample from a
multivariate density.
See "Introduction
to Monte Carlo methods", "Probabilistic
Inference using Markov Chain Monte Carlo Methods", and the Markov Chain Monte Carlo home
page. Software includes BUGS and FBM.
- Regularization
- Any estimation technique designed to impose a prior assumption of
"smoothness" on the fitted function.
See "Regularization
Theory and Neural Networks Architectures".
- Expectation-Maximization (EM)
- An optimization algorithm based on iteratively maximizing a lower
bound. Commonly used for maximum likelihood or maximum a posteriori
estimation, especially fitting a mixture of Gaussians.
See
- Variational bound optimization
- A catch-all term for variations on the EM algorithm which use
alternative lower bounds (usually simpler ones). The particular lower
bound used by EM can lead to an intractable E-step. With a looser
bound, the iterative update is more tractable, at the cost of
increasing the number of iterations. Another approach, though less
often used, is to use a tighter bound, for faster convergence but a
more expensive update. See "An
introduction to variational methods for graphical models", "Notes on
variational learning",
"Exploiting
tractable substructures in intractable networks".
- Variational bound integration
- To approximate the integral of a function, lower bound the function
and then integrate the lower bound. Not to be confused with Jensen bound integration.
Variational Bayes applies
this technique to the likelihood function for integrating out
parameters. The EM bound can be used for this, or any of the simpler
bounds used for variational bound optimization. See
- Jensen bound integration
- To approximate the integral of a function, apply Jensen's inequality
to turn the integral into a product which lower-bounds the integral.
The bound has free parameters which are chosen to make it as tight as
possible. Unlike variational bound optimization, the integrand
itself does not need to be bounded, and very different answers can result
from the two methods.
See
- Expectation Propagation
- To approximate the integral of a function, approximate each factor by
sequential moment-matching. For dynamic systems, it generalizes
Iterative Extended Kalman filtering. For Markov nets, it generalizes
belief propagation. See A roadmap to research on EP.
- Newton-Raphson
- A method for function optimization which iteratively maximizes a local
quadratic approximation to the objective function (not necessarily a
lower bound as in Expectation-Maximization). If the local
approximation is not quadratic, we have a generalized Newton
method. See
"Beyond Newton's method".
- Iteratively Reweighted Least Squares
- A method for maximum likelihood estimation of a generalized linear
model. It is equivalent to Newton-Raphson optimization.
See McCullagh&Nelder.
- Back-propagation
- A method for maximum likelihood estimation of a feed-forward neural
network. It is equivalent to steepest-descent optimization. See
Bishop.
- Backfitting
- A method for maximum likelihood estimation of a generalized additive regression.
You iteratively optimize each f_i while holding the others fixed.
It is equivalent to the Gauss-Seidel method in numerical linear algebra.
See Hastie&Tibshirani
and "Bayesian
backfitting".
- Kalman filtering
- An algorithm for inferring the next state or next observation of a Linear Dynamical
System. By making the state a constant, it can also be used for
incrementally building up a maximum-likelihood estimate of a
parameter. See
"An Introduction to the Kalman Filter" (with links),
"Dynamic Linear
Models, Recursive Least Squares and Steepest Descent Learning",
"From Hidden Markov Models to Linear Dynamical Systems",
and Gelb
(Ch.4).
- Extended Kalman filtering
- Kalman filtering applied to general dynamical systems with Gaussian noise.
At each step, the
dynamical system is approximated with a linear dynamical system, to
which the Kalman filter is applied. The linear approximation can be
iteratively refined to improve the accuracy of the Kalman filter
output. Despite the name, extended Kalman filtering is not really
different from Kalman filtering. See Gelb.
- Relaxation labeling
- An optimization algorithm for finding the most probable configuration of
a Markov random field.
It generalizes the Viterbi algorithm for Markov chains.
Other approaches to this problem include Iterated Complete Modes,
simulated annealing, network flow, and variational lower bounds.
See "Foundations of Relaxation Labeling Processes"
(Hummel and Zucker; appears in Readings in Computer Vision),
"Self Annealing: Unifying deterministic annealing and relaxation labeling",
"Probabilistic relaxation",
and Li.
- Deterministic annealing
- An optimization technique where the true objective function is
morphed into a convex function by a continuous convexity parameter.
Start by solving the convex problem and gradually morph to the true objective
while iteratively recomputing the optimum.
It is called "graduated nonconvexity" in statistical physics, where
the convexity parameter often corresponds to temperature.
See
- Boosting
- A technique for combining models based on adaptive resampling: different
data is given to different models. The idea is to successively omit
the "easy" data points, which are well modeled, so that the later
models focus on the "hard" data.
See
Schapire's page,
"Additive
Logistic Regression: a Statistical View of Boosting",
"Prediction Games and Arcing Algorithms", and
"Half&Half Bagging and Hard Boundary Points".
- Empirical Risk Minimization
- A parameter estimation heuristic that seeks parameter values that
minimize the "risk" or "loss" that the model incurs on the training
data. In classification, a "loss" usually means an error, so it
corresponds to choosing the model with lowest training error. In
regression, "loss" usually means squared error, so ERM corresponds to
choosing the curve with lowest squared error on the training data.
It is thus the most basic (and naive) estimation heuristic.
This method only uses a loss function appropriate for the problem
and does not utilize a probabilistic model for the data.
See "Empirical Risk Minimization is an incomplete inductive principle".
- Cross-validation
- A method for evaluating a statistical model or algorithm that has free
parameters. Divide the training data into several parts, and in turn use
one part to test the procedure fitted to the remaining parts.
It can be used for model selection or for parameter estimation when
there are many parameters.
Jackknifing is a similar, but slightly different, technique.
See
- Bayesian model selection
- Selecting the model which assigns the highest probability to the data
after all parameters have been integrated out.
See my research demo page.
Also see "Bayesian
Interpolation",
"Automatic choice of dimensionality for PCA",
"Hyperparameter Selection for Self-Organizing Maps",
and
Heckerman's papers.
- Minimum Message Length model selection
- The
Minimum Message Length
heuristic
for parameter estimation can be adapted for model selection as well.
See "MML Linear
Regression" and "Finding Overlapping Components with MML".
- Structural Risk Minimization
- Selecting the model whose expected future risk is minimal, assuming
the parameters of the model will be chosen according to Empirical Risk
Minimization.
See
Vapnik and
"An Experimental and Theoretical Comparison of Model Selection Methods"
- Other model selection criteria
- See
modelselection.org.
Nonparametric modeling
- Nearest-neighbor density estimation
- A technique for nonparametric density estimation.
The density at any point is inversely proportional to the distance to the
kth nearest datum. For the conditional density of y
given x, take the k nearest data points to
x and use them to recursively compute the density of
y (using any density estimator).
See Duda&Hart.
- Nearest-neighbor classification
- Nearest-neighbor density estimation applied to predicting the class of
an object. To classify measurement x, take the
k nearest training measurements and choose
the most popular class among them.
The quality of this method depends crucially on the distance metric.
This method is very sensitive to irrelevant features, so it is usually
combined with feature selection.
See Dasarathy,
"Towards a Better
Understanding of Memory-Based Reasoning Systems",
and "Flexible metric nearest-neighbor classification".
- Nearest-neighbor regression
- A technique for nonparametric regression.
Take the k nearest data points in x
and use them to recursively perform a regression on y (using any
regression algorithm).
See the local
learning Web site.
- Kernel density estimation
- A technique for nonparametric density estimation.
The density is given by centering a kernel function, e.g. a Gaussian bell
curve, on each data point and then adding the functions together.
The quality of the estimate depends crucially on the kernel function.
Also known as Parzen-window density estimation.
For the conditional density of y
given x, weight the data by distance to
x and use the weights to recursively compute the density of
y (using any density estimator).
If y is a class variable, then the result is a kernel
classifier.
See
Duda&Hart.
- Locally weighted regression
- A technique for nonparametric regression, similar to kernel density
estimation. To predict y from x, weight the
data by distance to x and compute a weighted linear
regression. The resulting curve can be nonlinear. This is also
called Moving Least Squares. Kernel regression is
similar but uses only a weighted average, not a linear regression, and
tends to chop off extrema of the function. See the local
learning Web site.
Bibliography
R.O. Duda and P.E. Hart. Pattern Classification and Scene
Analysis. New York: Wiley. 1973.
C. Bishop. Neural Networks for Pattern Recognition.
Oxford: Clarendon Press. 1995.
N.A.C. Cressie.
Statistics for spatial data.
New York: Wiley. 1993.
S.Z. Li.
Markov Random Field Modeling in Computer Vision.
Springer-Verlag. 1995.
Thomas P Minka