Tom Minka

Bayesian model selection uses the rules of probability theory to select among different hypotheses. It is completely analogous to Bayesian classification. It automatically encodes a preference for simpler, more constrained models, as illustrated at right. Simple models, e.g. linear regression, only fit a small fraction of data sets. But they assign correspondingly higher probability to those data sets. Flexible models spread themselves out more thinly.

The probability of the data given the model is computed by integrating over the unknown parameter values in that model:

which reduces to a problem of calculus. This quantity is called the
**evidence** for model `M`.
Specific examples of computing the evidence follow.

A useful property of Bayesian model selection is that it is guaranteed to select the right model, if there is one, as the size of the dataset grows to infinity.

degree 1 | degree 2 | degree 3 |

degree 4 | degree 5 | degree 6 |

The likelihood only considers the single most probable curve, as shown for each degree. The likelihood is the negative squared-error of the fit. It always increases with increasing degree. But the evidence reaches a maximum at degree 3, which happens to be the true degree in this case. The data is much more likely to have arisen from a third-degree polynomial than a high-degree polynomial which just happened to look third-degree.

The same procedure works for regression with splines, radial basis functions, and feedforward neural networks. See my paper "Bayesian linear regression".

This is actually quite simple mathematically since the integral over changepoint positions is a discrete sum. For example, consider a linear regression changepoint model. The integral over possible lines was solved in the curve fitting problem. Now we sum that quantity over possible changepoints. Here are two examples:

Evidence for each changepoint | Evidence for each changepoint |

On the left, there really is a changepoint at t = 9. Plotted underneath is the evidence for each changepoint position, which is an integral over the parameters of the two linear sub-models. Summing over all changepoint positions gives the probability of the data given that there is one changepoint. Comparing this to the evidence for zero changepoints, i.e. the evidence for a single line, gives an overwhelming probability of there being a changepoint.

On the right, there really is no changepoint. Summing over all changepoint positions gives a probability lower than that of a single line. So we correctly infer that there is no changepoint.

A central issue in PCA is choosing the dimensionality of the subspace component so that all of the noise, but none of the signal, is removed. To do this, we compute the probability of the data for each possible subspace dimensionality. For a given dimensionality, this requires integrating over all possible PCA decompositions, i.e. all possible subspaces. This can be done analytically, but it is simpler to approximate it using Laplace's method. Laplace's method essentially just measures the sensitivity of the likelihood to each parameter (the parameters here are the basis vectors for the subspace). Even simpler is the BIC method, which assumes the same sensitivity to each parameter. Both are correct asymptotically.

In the first example, a synthetic data set of 60 points in 100 dimensions has been sampled. Below is plotted the eigenvalues of the true covariance matrix and the observed covariance matrix. These measure the variance in each dimension. The dimensions after 5 are good candidates for noise, since they all have the same variance. The presence of a signal always increases the variance above the noise floor, which we assume is constant.

True eigenvalues | Observed eigenvalues |

Likelihood | Laplace evidence | BIC evidence |

Both methods peak at dimensionality 5, however BIC is increasingly unreliable for large dimensionalities. In some cases, the second peak may exceed the first.

Cross-validation, a non-Bayesian model selection technique, also picks 5. However it is far more costly to compute. The evidence approximations can be computed directly from the eigenvalue spectrum and they are very fast. See the paper "Automatic choice of dimensionality for PCA".

If the classifier function is linear in the parameters, then the task becomes similar to curve fitting as above. We still have the choice of what basis functions to use. Consider three different bases: linear, quadratic, and Gaussian kernel. Linear means our classifier function is simply a weighted sum of the measurements---the boundary is a line. Quadratic means the function is a quadratic function of the measurements---the boundary is a parabola, hyperbola, or ellipse. Gaussian kernel means our classifer function is a weighted sum of Gaussians centered on the data points. Example classifiers are shown below.

Linear | Quadratic | Gaussian kernel |

Linear | Quadratic | Gaussian kernel |

Linear | Quadratic | Gaussian kernel |

Intuitively, we see that the "best" classifier in each case balances simplicity against the need to fit the data. For Bayesian model selection, we integrate over the possible classifier functions for each model. Performing this integral is complicated by the fact that the likelihood of a classifier function is not a smooth function of its parameters. The likelihood is relatively constant until a data point becomes mislabeled, when it goes down abruptly. Thus the integral has a combinatorial flavor.

In my PhD work, I developed a deterministic method for approximating this integral which is faster than Monte Carlo methods. See "Expectation Propagation for approximate Bayesian inference".

- The section on model selection techniques in my statistical learning glossary.
- "Bayesian Model Selection and Model Averaging" by Wasserman.
- Papers by Raftery et al, especially "Bayes factors and model uncertainty".

Thomas P Minka Last modified: Fri Sep 23 08:48:26 GMT 2005