# A Statistical Learning/Pattern Recognition Glossary

### by Thomas Minka

Welcome to my glossary. It is inspired by Brian Ripley's glossary in "Pattern Recognition for Neural Networks" (and the need to save time explaining things).

## Fundamental concepts

Statistical Learning/Pattern Recognition
An approach to machine intelligence which is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/"common sense" to design an algorithm.

Features
The measurements which represent the data. The statistical model one uses is crucially dependent on the choice of features. Hence it is useful to consider alternative representations of the same measurements (i.e. different features). For example, different representations of the color values in an image. General techniques for finding new representations include discriminant analysis, principal component analysis, and clustering.

Classification
Assigning a class to a measurement, or equivalently, identifying the probabilistic source of a measurement. The only statistical model that is needed is the conditional model of the class variable given the measurement. This conditional model can be obtained from a joint model or it can be learned directly. The former approach is generative since it models the measurements in each class. It is more work, but it can exploit more prior knowledge, needs less data, is more modular, and can handle missing or corrupted data. Methods include mixture models and Hidden Markov Models. The latter approach is discriminative since it focuses only on discriminating one class from another. It can be more efficient once trained and requires fewer modeling assumptions. Methods include logistic regression, generalized linear classifiers, and nearest-neighbor. See "Discriminative vs Informative Learning".

Regression
Predicting the value of random variable y from measurement x. For example, predicting engine efficiency based on oil pressure. Regression generalizes classification since y can be any quantity, including a class index. Many classification algorithms can be understood as thresholding the output of a regression. Like classification, one can obtain the conditional model of y from a joint model (which includes a model of x) or it can be learned directly. Curve fitting is the common special case where y is assumed to be a deterministic function of x, plus additive noise (usually Gaussian). Methods for curve fitting include radial basis functions, feed-forward neural networks, and mixtures of experts.

Nonparametric regression/density estimation
An approach to regression/density estimation that doesn't require much prior knowledge but only a large amount of data. For regression, it includes nearest-neighbor, weighted average, and locally weighted regression. For density estimation, it includes histograms, kernel smoothing, and nearest-neighbor.

Parameter Estimation
Density estimation when the density is assumed to be in a specific parametric family. Special cases include maximum likelihood, maximum a posteriori, unbiased estimation, and predictive estimation. See the section on Parameter estimation techniques.

Model selection
Choosing the parametric family to use for density estimation. This is harder than parameter estimation since you have to take into account every member of each family in order to choose the best family. Considering only the best member of each family is not sufficient (one would tend to choose the biggest family). See the section on Model selection techniques.

Independence Diagram
A graphical way of expressing the conditional independence relationships among a set of random variables. They cannot encode every possible form of conditional independence but they go a long way toward this end. They are also called "Bayesian networks." See "Independence Diagrams", A Brief Introduction to Graphical Models and Bayesian Networks, Course Notes on Bayesian Networks, and Pearl.

Active learning
Determining the optimal measurements to make under a cost constraint. A measurement is "optimal" when it is expected to give the most new information about the parameters of a model. Active learning is thus an application of decision theory to the process of learning. It is also known as experiment design. See "Employing EM in Pool-Based Active Learning for Text Classification", "Selective sampling using the Query by Committee algorithm", "Reinforcement Learning: A Survey", Box&Draper, and Raiffa&Schlaifer.

Reinforcement learning
Learning how to act optimally in a given environment, especially with delayed and nondeterministic rewards. It is equivalent to adaptive control. There are two interleaved tasks: modeling the environment and making optimal decisions based on the model. The first task is a statistical modeling problem and is handled using the techniques listed in this glossary. The second task is a decision theory problem: converting the expectation of delayed reward into an immediate action. Since reinforcement learning requires exploration, it is often combined with active learning, though this is not essential. Most learning problems that humans face are reinforcement learning problems, e.g. deciding which melon to buy, which coat to wear outside today, or which friends to have. See "Reinforcement Learning: A Survey" and "Reinforcement Learning: A Tutorial".

No free lunch
The point that all statistical models are necessarily biased in one way or another, and that no single bias is globally optimal. Mitchell, and later Wolpert, emphasized this point in order to stop useless comparisons between learning algorithms that were using different priors (like Euclidean nearest neighbor vs. axis-parallel decision trees). The real way to evaluate algorithms is how well they can utilize prior knowledge given to them, i.e. how well they can approximate Bayesian learning. See "The Need for Biases in Learning Generalizations" in Readings in Machine Learning, "The lack of a priori distinctions between learning algorithms", "Bayesian regression filters and the issue of priors" (the "issue of priors" part), and Cross-validation.

## Feature extraction techniques

Discriminant Analysis
Constructing new features via linear combination so that classification is easier. The notion of "easier" is quantified by Fisher's criterion, which applies exactly to Gaussian classes with equal variance and approximately to other models. Variants like Flexible discriminant analysis consider nonlinear combinations as well. See Duda&Hart, "Eigenfaces vs. Fisherfaces", and "Flexible Discriminant and Mixture Models".

Principal Component Analysis
Constructing new features which are the principal components of a data set. The principal components are random variables of maximal variance constructed from linear combinations of the input features. Equivalently, they are the projections onto the principal component axes, which are lines that minimize the average squared distance to each point in the data set. To ensure uniqueness, all of the principal component axes must be orthogonal. PCA is a maximum-likelihood technique for linear regression in the presence of Gaussian noise on both inputs and outputs. In some cases, PCA corresponds to a Fourier transform, such as the DCT used in JPEG image compression. See "Eigenfaces for recognition" (Turk&Pentland, J Cognitive Neuroscience 3(1), 1991), Bishop, "Probabilistic Principal Component Analysis", and "Automatic choice of dimensionality for PCA".

Principal curve
A nonlinear principal component axis. Principal curves are smooth curves that minimize the average squared orthogonal distance to each point in a data set. Fitting a principal curve is a maximum-likelihood technique for nonlinear regression in the presence of Gaussian noise on both inputs and outputs. Principal points are individual points that minimize the average distance to each point in a data set (they are the output of k-means). See the Principal Curve Website.

Factor analysis
A generalization of PCA which is based explicitly on maximum-likelihood. Like PCA, each data point is assumed to arise from sampling a point in a subspace and then perturbing it with full-dimensional Gaussian noise. The difference is that factor analysis allows the noise to have an arbitrary diagonal covariance matrix, while PCA assumes the noise is spherical. In addition to estimating the subspace, factor analysis estimates the noise covariance matrix. See "The EM Algorithm for Mixtures of Factor Analyzers".

Independent Component Analysis
Constructing new features which are the independent components of a data set. The independent components are random variables of minimum entropy constructed from linear combinations of the input features. The entropy is normalized by the variance of the component, so absolute scale doesn't matter. It is a fact of information theory that such variables will be as independent as possible. This feature extraction technique is closely related to exploratory projection pursuit, commonly used for visualization. See "Survey on Independent Component Analysis", "Fast and Robust Fixed-Point Algorithms for Independent Component Analysis" (with software), "Independent Component Analysis: A flexible non-linearity and decorrelating manifold approach" and the ICA Web site.

Clustering
Grouping similar objects in a multidimensional space. It is useful for constructing new features which are abstractions of the existing features. Some algorithms, like k-means, simply partition the feature space. Other algorithms, like single-link agglomeration, create nested partitionings which form a taxonomy. Another possibility is to learn a graph structure between the clusters, as in the Growing Neural Gas. The quality of the clustering depends crucially on the distance metric in the space. Most techniques are very sensitive to irrelevant features, so they should be combined with feature selection. See Duda&Hart, and, for an interesting application, "Clustering sequences with hidden Markov models".

K-means
A parametric algorithm for clustering data into exactly k clusters. First, define some initial cluster parameters. Second, assign data points to clusters. Third, recompute better cluster parameters, given the data assignment. Iterate back to step two. It is related to the Expectation-Maximization algorithm for fitting a mixture of Gaussians. See "Hard and Soft Assignment Methods for Clustering", and Duda&Hart.

Jarvis&Patrick clustering
A generalization of single-link agglomeration that can be more robust to noise. Unlike k-means, it can find clusters of arbitrary shape. See on-line description.

Feature selection
Not extracting new features but rather removing features which seem irrelevant for modeling. This is a combinatorial optimization problem. The direct approach (the "wrapper" method) retrains and re-evaluates a given model for many different feature sets. An approximation (the "filter" method) instead optimizes simple criteria which tend to improve performance. The two simplest optimization methods are forward selection (keep adding the best feature) and backward elimination (keep removing the worst feature). See

## Statistical models

(Each has parameters which can be estimated by a variety of methods.)

Linear regression
A conditional statistical model of random vector y given measurement vector x, where y is a linear transformation of x followed by additive noise, typically Gaussian. That is, y = Ax + e. The matrix A is the parameter of the model. See "Bayesian linear regression".

Basis function regression
Linear regression applied to a new feature set, i.e. a new basis. If f_i(x) is basis function i, then the conditional model is y = sum_i a_i f_i(x) + e. The parameters of the model are the weights a_i. Polynomial regression and spline regression are special cases.

Gaussian process regression
A non-parametric regression model which is equivalent to basis function regression with a Gaussian prior density on the coefficients. The basis functions are implicitly determined by the covariance function of a Gaussian process. The advantage of this formulation is that the number of basis functions is potentially infinite, and they can be adapted, making it competitive with feed-forward neural network regression. See "Gaussian processes: A replacement for supervised neural networks?", Rasmussen's Web site, and Cressie.

Basis function regression where each new feature is based on the distance to a prototype, hence the basis is "radial." The resulting curve is a superposition of "bumps," one at each prototype. Such models are motivated by regularization theory and Gaussian process theory. The basis functions are adapted by moving the prototypes or reshaping the bumps. See "Introduction to Radial Basis Function Networks", "Regularization Theory and Neural Networks Architectures", and "How do MLPs compare with RBFs?".

Generalized linear model
Linear regression except the conditional density of y is an arbitrary function of Ax, i.e. there is not simply additive noise. For example, a Poisson model for y, where the Poisson rate is equal to Ax, is a generalized linear model. Logistic regression is also a kind of generalized linear model. The attraction of such models is that the parameter A can be learned in a very similar way as for linear regression. See McCullagh&Nelder.

Logistic regression
A conditional statistical model of binary variable y given measurement vector x. The probability that y is 1 is given by the logistic function applied to a linear combination of x. That is, p(y=1) = 1/(1+exp(-a*x)). Logistic regression is a generalized linear model (and is really logistic linear regression). The row vector a is the parameter of the model. Logistic regression gives rise to a linear classifier. See a logistic regression overview and "Why the logistic function?".

Gaussian process classifier
A non-parametric logistic regression model analogous to Gaussian process regression. It is equivalent to using basis functions in logistic regression. See "Bayesian Classification with Gaussian Processes" and Rasmussen's Web site.

Feed-forward neural network regression
Basis function regression with adaptive basis functions. Given a measurement vector, each layer of the network makes a linear transformation and then applies a nonlinearity to each vector component. For example, a two-layer network has y = A2*f(A1*x) + e. The matrix A2 contains the basis coefficients while the matrix A1 adapts the basis. The nonlinearity f is fixed and is typically chosen to be a smoothed step function. The resulting basis contains oriented step functions with varying slope. See Bishop and "How many hidden layers should I use?".

Feed-forward neural network density model
Neural network regression except the conditional density of y is an arbitrary function of the network output, i.e. there is not simply additive noise. An example is logistic network regression. Follows the same principle as a generalized linear model.

A multivariate regression expressed as the sum of univariate regressions. That is, y is a sum of functions applied to the components of x, plus additive noise: y = sum_i f_i(x_i) + e. When the conditional density of y is an arbitrary function of sum_i f_i(x_i), such as a logistic additive regression, then you have a generalized additive regression. The parameters of the model are the functions f_i, which are learned recursively by another regression model. These models are very useful for visualization (also see projection pursuit regression). See Hastie&Tibshirani and Backfitting.

Projection pursuit regression
Additive regression on an adaptive feature space. That is, y = sum_i f_i(a_i*x) + e where the row vector a_i takes a linear combination of x. The parameters of the model are the axes a_i as well as the functions f_i, which are learned recursively. See "Projection Pursuit Regression" and "Generalized Projection Pursuit Regression".

Robust regression
A regression model which includes the possibility of measurement outliers. It also refers to parameter estimation that can handle outliers. (But changing your estimator without changing your model is inconsistent behavior.) See Venables&Ripley, Barnett&Lewis, code and papers at Antwerp, and an illustrative applet.

Independent Feature Model (`Naive Bayes')
Any statistical model which assumes that the components of the measurement vector, i.e. the features, are conditionally independent given the class. Like additive regression, this assumption allows each component to be modeled separately. Independent Component Analysis/Projection Pursuit can be used to extract better features for the Independent Feature Model. See Duda&Hart, "A Comparison of Event Models for Naive Bayes Text Classification", and "Beyond independence: Conditions for the optimality of the simple Bayesian classifier".

Linear classifier
Any classifier with straight-line decision boundaries in feature space. Such classifiers can arise from many different statistical models, especially Gaussian class-conditional densities. They can be implemented using a single-layer neural network or Perceptron. See Duda&Hart, Bishop.

Generalized linear classifier
Any classifier implemented by subjecting data to a nonlinear transformation before feeding it to a linear classifier. The transformation may increase or decrease the number of features. The new features are like basis functions in basis function regression, and the classifier is essentially thresholding a basis function regression. It can have arbitrarily shaped decision boundaries. A Radial Basis Function classifier is a special case where the transformation is based on distances to a set of prototypes (the basis functions are "bumps"). See Duda&Hart.

Support Vector Machine
A generalized linear classifier with a maximum-margin fitting criterion. This fitting criterion provides regularization which helps the classifier generalize better. The classifier tends to ignore many of the features. For example, a Support Vector Machine can be used as a regularized Radial Basis Function classifier. See

Finite mixture model
Any probability density expressed as the weighted sum of simpler densities. The simpler densities are the components or states of the mixture. An equivalent definition is hierarchical sampling: to sample from the density, first draw a state at random (using a distribution over states) and then sample from that component. Using Gaussian components leads to a mixture of Gaussians. The distribution over states can itself be a mixture, leading to a deeper hierarchy, which is a hierarchical mixture model. Any sort of stochastic decision tree is a hierarchical mixture model. See David Dowe's website, "Cluster-Based Probability Model and Its Application to Image and Texture Processing", Bishop, and Expectation-Maximization.

Mixture of subspaces
A mixture of Gaussians where each Gaussian models only a subset of the features. The other features have a much smaller variance, producing a flat, pancake-like density. In other words, the covariance matrix of the Gaussian is nearly singular, reducing the number of parameters to estimate. Each Gaussian applies some feature extraction technique like PCA to determine the features to use. It is thus a combination of clustering and feature extraction. See "The EM Algorithm for Mixtures of Factor Analyzers" (with software), "Mixtures of Principal Component Analyzers", "Mixtures of local linear subspaces for face recognition", and "Modeling the Manifolds of Images of Handwritten Digits".

Coupled mixture model
A joint statistical model for a set of variables, where each variable has a mixture distribution. The mixtures are coupled by a joint distribution over states. For example, you could constrain two mixtures to never be in the same state. To sample the variables, you first sample a state for each variable (using the joint distribution over states) and then sample each variable from its chosen component. Coupled mixture models include Hidden Markov Models and Hidden Markov random fields.

Markov chain
Any multivariate probability density whose independence diagram is a chain. In other words, the variables are ordered, and each variable "depends" only on its neighbors in the sense of being conditionally independent of the others. An equivalent definition is that you sample the variables left-to-right, conditional only on the last outcome.

Autoregression
A Markov chain model for a sequence of variables, where the next variable is predicted from the previous variable via regression, typically linear.

Hidden Markov Model
A joint statistical model for an ordered sequence of variables. It is the result of stochastically perturbing the variables in a Markov chain (the original variables are thus "hidden"). The Markov chain has discrete variables which select the "state" of the HMM at each step. The perturbed values can be continuous and are the "outputs" of the HMM. A Hidden Markov Model is equivalently a coupled mixture model where the joint distribution over states is a Markov chain. If you made a scatterplot of the data, ignoring the ordering, it would look like a finite mixture. Only when you consider the ordering will you see that the progression of states has Markov structure: the next state depends on the previous state. See

Autoregressive Hidden Markov Model
A Hidden Markov Model except that each output depends not only on the state but also the previous output, usually via a regression model. This extension can be implemented with surprisingly little extra work. (This differs from Rabiner's terminology. Rabiner's "autoregressive HMM" refers to a particular choice of emission density, and would now be called a crude Hidden Markov Segment Model.) See "Hidden Markov models using shared and global vector linear predictors", "ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition" (Digalakis, Rohlicek, & Ostendorf, IEEE Trans Speech and Audio 1(4), 1993), and "The Double Chain Markov Model".

Input/Output Hidden Markov Model
A Hidden Markov Model which is controlled by an input sequence. The input sequence can influence the choice of state at each step or it can influence the choice of output at each step. See "Markovian Models for Sequential Data" and "Input/Output HMMs for Sequence Processing".

Factorial Hidden Markov Model
The result of summing the outputs of several Hidden Markov Models. More generally, any arithmetic operation could be used, as long as it only applies to the outputs at a particular time. It is equivalent to one large HMM with a factored state distribution: the state is made up of independent parts. Factoring the state distribution is a general approach to reducing the number of parameters in a hidden variable model. Summing the outputs of ordinary mixture models gives a factorial mixture model. See "Factorial Learning and the EM Algorithm" and "Factorial Hidden Markov Models" (software).

Hidden Markov Decision Tree
A Hidden Markov Model where the distribution over states is hierarchical, as in a Hierarchical Mixture Model. The hierarchical path chosen for the next state depends on the hierarchical path for the previous state. In other words, the outcome at a particular node in the hierarchy is governed by a Markov chain. Thus it is both a Factorial Hidden Markov Model and a Coupled Hidden Markov Model. See "Hidden Markov Decision Trees".

Switching Hidden Markov Model
The result of switching among the outputs of several Hidden Markov Models. That is, only one of the outputs is measured at any particular time. The switching is controlled by a separate Markov chain. See "Variational Learning for Switching State-Space Models".

Hidden Markov Model with Duration
A Hidden Markov Model where the next state depends not just on the previous state, but on how long you have been in that state. Equivalently, it is an HMM with a nonstationary state transition matrix, that changes whenever you stay in the same state, and resets when you leave the state. See Rabiner&Juang (Ch.6).

Hidden Markov Segment Model
A Hidden Markov Model where the outputs are not simply vectors but sequences of arbitrary length ("segments"). The length of a segment and the variables in a segment can have an arbitrary probability distribution conditional on the state. In particular, the variables can be correlated. For example, the segment model can itself be an HMM, giving a Hierarchical Hidden Markov Model. The segment model can be a Gaussian process or (more specifically) a Linear Dynamical System. A Hidden Markov Model with Duration is a special case. A common refinement is that each segment influences the next segment, e.g. to provide continuity; this is the analogue of an autoregressive HMM. See

Linear Dynamical System
A time-series model closely related to the Hidden Markov Model. It generalizes the HMM in some ways and is more restricted in other ways. The generalization is that there is a continuum of possible hidden states; it is an infinite mixture. The restrictions are (1) the state transitions are linear (plus Gaussian noise) and (2) the observations are linear in the state value (plus Gaussian noise). The Linear Dynamical System generalizes linear autoregression in the same way that Hidden Markov Models generalize Markov models. Hence it might be called hidden autoregression. All variables in the sequence are jointly Gaussian: it is a special kind of Gaussian process. See "From Hidden Markov Models to Linear Dynamical Systems", Gelb (Ch.4), and Kalman filtering.

General dynamical system
A general time-series model based on stochastically perturbing the variables in a Markov chain. The Markov chain may be continuous, the state transitions may be nonlinear with non-Gaussian noise, and the observations may be a nonlinear function of the state plus non-Gaussian noise. In such models, exact Bayesian inference is not feasible. Instead we use special approximation techniques which operate sequentially in time, such as extended Kalman filtering and Monte Carlo filtering. Monte Carlo filtering uses Monte Carlo integration to solve the necessary integrals at each step. See

Mixture of Experts
A finite mixture model for random variable y where all the components and the distribution over components are conditional on measurement x. The mixture components are the "experts" and are typically linear regressions. A piecewise linear regression is a special case where x selects a unique expert. The distribution over experts can be hierarchical, as in a hierarchical mixture model, giving a Hierarchical Mixture of Experts. The measurement x influences the hierarchical path, as in a decision tree. See "Learning in modular and hierarchical systems", "A statistical approach to decision tree modeling", "Hierarchical mixtures of experts and the EM algorithm", and Steve Waterhouse's papers.

Decision tree
The deterministic ancestor of the hierarchical mixture of experts. Used widely in Machine Learning, where the most emphasis is on structuring the tree. See Murthy's survey, Ripley, and web site.

Markov random field
A set of random variables whose independence relationships are represented by a neighborhood structure, i.e. an undirected graph. The Markov property asserts conditional independence: given its immediate neighbors in the graph, a variable is independent of all other variables. This property is particularly useful for specifying the conditional distribution of a single variable, making the representation well suited to Gibbs sampling. Specifying all of the conditional distributions completely determines the joint distribution of the variables. This is a very general model, subsuming Markov chains. The color values in an image or the temperature across an object can be modeled as Markov random fields. When the variables are binary, this model is a Boltzmann machine. See Cressie, Li, Hertz (ch 7), Kindermann, and Relaxation labeling.

Simultaneous autoregression
A Markov random field where the each variable is predicted from its neighbors via regression, typically linear. It generalizes autoregression from Markov chains to Markov random fields. See Cressie.

Hidden Markov random field
The result of stochastically perturbing the variables in a Markov random field (the original variables are thus "hidden"). The result is technically still a Markov random field, but some variables are never observed. It is equivalently a coupled mixture model where the coupling distribution is a Markov random field. This generalizes a Hidden Markov Model by including more dependencies between the hidden states. A noisy image or depth map can be modeled as a Hidden Markov random field. See the references on Markov random fields.

Constrained mixture model
A finite mixture model with a restrictive prior distribution on the possible parameters of the components. Most importantly, the component parameters are statistically dependent. For example, a mixture of Gaussians where the means must form a circle or grid is a constrained mixture. Constrained mixtures are useful for approximating principal curves as well as for constrained clustering. Known historically as a self-organizing map. See

Constrained Hidden Markov Model
A Hidden Markov Model with a restrictive prior distribution on the possible parameters of the output densities, which makes them statistically dependent. Equivalently, the Markov generalization of a constrained mixture. Has its origins in speaker adaptation for speech recognition. See

Coupled Hidden Markov Model
A set of Hidden Markov Models whose Markov chains influence each other. There are two orthogonal ways to produce coupling. One is to couple the simultaneous states of the Markov chains, e.g. constraining them to always be in different states at any particular time. Another way is to allow the next state of a chain to depend on the previous state of another chain. Unlike a Factorial Hidden Markov Model, the outputs of the Markov chains do not interact. A Hidden Markov Decision Tree is both coupled and factorial. See "Coupled hidden Markov models for modeling interacting processes".

Stochastic context-free grammar
A joint statistical model for an arbitrarily-long sequence of variables which generalizes a Markov chain. Each sequence is generated, starting from a single symbol, by stochastically applying the productions of a context-free grammar, until no more productions apply. Like a Markov chain, the most likely productions are easy to find for a given sequence, yet the sequence can have long-range correlations. See

Stochastic program
A joint statistical model expressed as an explicit sampling procedure. Stochastic grammars are a special case. See "Effective Bayesian Inference for Stochastic Programs", "Bellman Equations for Stochastic Programs", "Nondeterministic Lisp as a Substrate for Constraint Logic Programming", SCREAMER (software), and the modeling language of BUGS.

## Parameter estimation techniques

Maximum likelihood
A parameter estimation heuristic that seeks parameter values that maximize the likelihood function for the parameter. This ignores the prior distribution and so is inconsistent with Bayesian probability theory, but it works reasonably well. See "Pathologies of Orthodox Statistics".

Maximum A Posteriori
A parameter estimation heuristic that seeks parameter values that maximize the posterior density of the parameter. This implies that you have a prior distribution over the parameter, i.e. that you are Bayesian. MAP estimation has the highest chance of getting the parameter exactly right. But for predicting future data, it can be worse than Maximum Likelihood; predictive estimation is a better Bayesian method for that purpose. MAP is also not invariant to reparameterization; see MacKay's Bayesian methods FAQ.

Unbiased estimation
A parameter estimation procedure based on finding an estimator function that minimizes average error. When the average error is zero then the estimator is "unbiased." The error of the function is averaged over possible data sets, including ones you never observed. The best function is then used to get parameter values. See "Pathologies of Orthodox Statistics".

Predictive estimation
Parameter estimation consistent with Bayesian probability theory. It seeks to minimize the expected "divergence" between the estimated distribution and the true distribution. The divergence is measured by Kullback and Leibler's formula. The distribution which achieves minimum divergence corresponds to integrating out the unknown parameter. Hence predictive estimation can be approximated by averaging over several different parameter choices. See "Inferring a Gaussian distribution", "A Comparison of Scientific and Engineering Criteria for Bayesian Model Selection", Geisser, and Bishop.

Minimum Message Length
A parameter estimation technique similar to predictive estimation but motivated by information theory. Consider compressing the data via a two-part code: the first part is a parameter setting, encoded with respect to the prior, and the second part is the data, encoded with respect to the model with that parameter. Parameters are continuous, and so cannot be encoded exactly---they must be quantized, which introduces error. So we can't choose the parameters which simply compress the data most; we have to choose parameters which compress the data well even if the parameters are slightly modified. The parameter setting which balances this tradeoff between accuracy and robustness is the MML estimate. See
Some related methods:

Bootstrapping
A technique for simulating new data sets, to assess the robustness of a model or to produce a set of likely models. The new data sets are created by re-sampling with replacement from the original training set, so each datum may occur more than once. See "What are cross-validation and bootstrapping?" and "The Bootstrap is Inconsistent with Probability Theory".

Bagging
Bootstrap averaging. Generate a bunch of models via bootstrapping and then average their predictions. See "Bagging Predictors", "Why does bagging work?", and "Bayesian model averaging is not model combination".

Monte Carlo integration
A technique for approximating integrals in Bayesian inference. To approximate the integral of a function over a domain D, generate samples from a uniform distribution over D and average the value of the function at those samples. More generally, we can use a non-uniform proposal distribution, as long as we weight samples accordingly. This is known as importance sampling (which is an integration method, not a sampling method). For Bayesian estimation, a popular approach is to sample from the posterior distribution, even though it is usually not the most efficient proposal distribution. Gibbs sampling is typically used to generate the samples. Gibbs sampling employs a succession of univariate samples (a Markov Chain) to generate an approximate sample from a multivariate density. See "Introduction to Monte Carlo methods", "Probabilistic Inference using Markov Chain Monte Carlo Methods", and the Markov Chain Monte Carlo home page. Software includes BUGS and FBM.

Regularization
Any estimation technique designed to impose a prior assumption of "smoothness" on the fitted function. See "Regularization Theory and Neural Networks Architectures".

Expectation-Maximization (EM)
An optimization algorithm based on iteratively maximizing a lower bound. Commonly used for maximum likelihood or maximum a posteriori estimation, especially fitting a mixture of Gaussians. See

Variational bound optimization
A catch-all term for variations on the EM algorithm which use alternative lower bounds (usually simpler ones). The particular lower bound used by EM can lead to an intractable E-step. With a looser bound, the iterative update is more tractable, at the cost of increasing the number of iterations. Another approach, though less often used, is to use a tighter bound, for faster convergence but a more expensive update. See "An introduction to variational methods for graphical models", "Notes on variational learning", "Exploiting tractable substructures in intractable networks".

Variational bound integration
To approximate the integral of a function, lower bound the function and then integrate the lower bound. Not to be confused with Jensen bound integration. Variational Bayes applies this technique to the likelihood function for integrating out parameters. The EM bound can be used for this, or any of the simpler bounds used for variational bound optimization. See

Jensen bound integration
To approximate the integral of a function, apply Jensen's inequality to turn the integral into a product which lower-bounds the integral. The bound has free parameters which are chosen to make it as tight as possible. Unlike variational bound optimization, the integrand itself does not need to be bounded, and very different answers can result from the two methods. See

Expectation Propagation
To approximate the integral of a function, approximate each factor by sequential moment-matching. For dynamic systems, it generalizes Iterative Extended Kalman filtering. For Markov nets, it generalizes belief propagation. See A roadmap to research on EP.

Newton-Raphson
A method for function optimization which iteratively maximizes a local quadratic approximation to the objective function (not necessarily a lower bound as in Expectation-Maximization). If the local approximation is not quadratic, we have a generalized Newton method. See "Beyond Newton's method".

Iteratively Reweighted Least Squares
A method for maximum likelihood estimation of a generalized linear model. It is equivalent to Newton-Raphson optimization. See McCullagh&Nelder.

Back-propagation
A method for maximum likelihood estimation of a feed-forward neural network. It is equivalent to steepest-descent optimization. See Bishop.

Backfitting
A method for maximum likelihood estimation of a generalized additive regression. You iteratively optimize each f_i while holding the others fixed. It is equivalent to the Gauss-Seidel method in numerical linear algebra. See Hastie&Tibshirani and "Bayesian backfitting".

Kalman filtering
An algorithm for inferring the next state or next observation of a Linear Dynamical System. By making the state a constant, it can also be used for incrementally building up a maximum-likelihood estimate of a parameter. See "An Introduction to the Kalman Filter" (with links), "Dynamic Linear Models, Recursive Least Squares and Steepest Descent Learning", "From Hidden Markov Models to Linear Dynamical Systems", and Gelb (Ch.4).

Extended Kalman filtering
Kalman filtering applied to general dynamical systems with Gaussian noise. At each step, the dynamical system is approximated with a linear dynamical system, to which the Kalman filter is applied. The linear approximation can be iteratively refined to improve the accuracy of the Kalman filter output. Despite the name, extended Kalman filtering is not really different from Kalman filtering. See Gelb.

Relaxation labeling
An optimization algorithm for finding the most probable configuration of a Markov random field. It generalizes the Viterbi algorithm for Markov chains. Other approaches to this problem include Iterated Complete Modes, simulated annealing, network flow, and variational lower bounds. See "Foundations of Relaxation Labeling Processes" (Hummel and Zucker; appears in Readings in Computer Vision), "Self Annealing: Unifying deterministic annealing and relaxation labeling", "Probabilistic relaxation", and Li.

Deterministic annealing
An optimization technique where the true objective function is morphed into a convex function by a continuous convexity parameter. Start by solving the convex problem and gradually morph to the true objective while iteratively recomputing the optimum. It is called "graduated nonconvexity" in statistical physics, where the convexity parameter often corresponds to temperature. See

Boosting
A technique for combining models based on adaptive resampling: different data is given to different models. The idea is to successively omit the "easy" data points, which are well modeled, so that the later models focus on the "hard" data. See Schapire's page, "Additive Logistic Regression: a Statistical View of Boosting", "Prediction Games and Arcing Algorithms", and "Half&Half Bagging and Hard Boundary Points".

Empirical Risk Minimization
A parameter estimation heuristic that seeks parameter values that minimize the "risk" or "loss" that the model incurs on the training data. In classification, a "loss" usually means an error, so it corresponds to choosing the model with lowest training error. In regression, "loss" usually means squared error, so ERM corresponds to choosing the curve with lowest squared error on the training data. It is thus the most basic (and naive) estimation heuristic. This method only uses a loss function appropriate for the problem and does not utilize a probabilistic model for the data. See "Empirical Risk Minimization is an incomplete inductive principle".

## Model selection techniques

Cross-validation
A method for evaluating a statistical model or algorithm that has free parameters. Divide the training data into several parts, and in turn use one part to test the procedure fitted to the remaining parts. It can be used for model selection or for parameter estimation when there are many parameters. Jackknifing is a similar, but slightly different, technique. See

Bayesian model selection
Selecting the model which assigns the highest probability to the data after all parameters have been integrated out. See my research demo page. Also see "Bayesian Interpolation", "Automatic choice of dimensionality for PCA", "Hyperparameter Selection for Self-Organizing Maps", and Heckerman's papers.

Minimum Message Length model selection
The Minimum Message Length heuristic for parameter estimation can be adapted for model selection as well. See "MML Linear Regression" and "Finding Overlapping Components with MML".

Structural Risk Minimization
Selecting the model whose expected future risk is minimal, assuming the parameters of the model will be chosen according to Empirical Risk Minimization. See Vapnik and "An Experimental and Theoretical Comparison of Model Selection Methods"

Other model selection criteria
See modelselection.org.

## Nonparametric modeling

Nearest-neighbor density estimation
A technique for nonparametric density estimation. The density at any point is inversely proportional to the distance to the kth nearest datum. For the conditional density of y given x, take the k nearest data points to x and use them to recursively compute the density of y (using any density estimator). See Duda&Hart.

Nearest-neighbor classification
Nearest-neighbor density estimation applied to predicting the class of an object. To classify measurement x, take the k nearest training measurements and choose the most popular class among them. The quality of this method depends crucially on the distance metric. This method is very sensitive to irrelevant features, so it is usually combined with feature selection. See Dasarathy, "Towards a Better Understanding of Memory-Based Reasoning Systems", and "Flexible metric nearest-neighbor classification".

Nearest-neighbor regression
A technique for nonparametric regression. Take the k nearest data points in x and use them to recursively perform a regression on y (using any regression algorithm). See the local learning Web site.

Kernel density estimation
A technique for nonparametric density estimation. The density is given by centering a kernel function, e.g. a Gaussian bell curve, on each data point and then adding the functions together. The quality of the estimate depends crucially on the kernel function. Also known as Parzen-window density estimation. For the conditional density of y given x, weight the data by distance to x and use the weights to recursively compute the density of y (using any density estimator). If y is a class variable, then the result is a kernel classifier. See Duda&Hart.

Locally weighted regression
A technique for nonparametric regression, similar to kernel density estimation. To predict y from x, weight the data by distance to x and compute a weighted linear regression. The resulting curve can be nonlinear. This is also called Moving Least Squares. Kernel regression is similar but uses only a weighted average, not a linear regression, and tends to chop off extrema of the function. See the local learning Web site.

## Bibliography

R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. New York: Wiley. 1973.

C. Bishop. Neural Networks for Pattern Recognition. Oxford: Clarendon Press. 1995.

N.A.C. Cressie. Statistics for spatial data. New York: Wiley. 1993.

S.Z. Li. Markov Random Field Modeling in Computer Vision. Springer-Verlag. 1995.

Thomas P Minka