## 2/18/97

Paper: ``Learning from incomplete data'' Z. Ghahramani and M. I. Jordan. AI Memo 1509, CBCL Paper 108, January 1995, 11 pages.

This paper presents a general framework for coping with missing data. Interestingly, classification and regression are special cases where the output is the missing attribute. In these cases, Ghahramani and Jordan's algorithm reduces to a piecewise linear form similar, but not identical to, a mixture of experts. This is a good demonstration of how algorithms for density estimation can be converted into algorithms for other learning problems. Of course, missing data is itself an important problem, especially in meta-learning.

Tom Minka led the discussion. Two well-known approaches to dealing with missing data are mean imputation, where the value is replaced with its unconditional mean, and regression imputation, where the value is predicted from the rest of its feature vector. After the missing value is imputed, learning proceeds as normal. This works fine for many algorithms, e.g. nearest neighbor, Bernoulli, or separable Gaussian, but for others, e.g. fitting a full-covariance Gaussian, it can lead to biased estimates.

A completely general, but expensive, approach to dealing with missing data is multiple imputation: generate several predictions, sampled from the distribution of the missing value conditioned on the rest of its feature vector. Then run the normal algorithm on this enlarged data set. Ideally, the algorithm should "trust" a sample according to its probability under the model it was generated from (e.g. weighted least squares).

The authors essentially present a simpler way to do multiple imputation. If the conditional distribution of a missing value has a simple form, e.g. Gaussian, then the distribution itself can be used as the value, rather than a sample from this distribution. This is in fact the E step of an EM algorithm. The learning algorithm then must be changed to deal with samples which are not points but distributions. This is easy when fitting a simple model like a Gaussian, as shown in the paper. You might call this "distribution imputation" or "sufficient statistic imputation": impute not only the mean but also the parameters of the missing value's distribution. In the particular cases chosen by the authors, the joint model being learned happens to have the same form as the conditional imputed distributions, but this is quite rare and I think need not be true for the method to work.

Kris Popat identified a problem with joint modeling as a lead-in for regression. Given limited modeling resources, they are best spent modeling changes in the output, not the input. Of course, some modeling effort should go to the input, in order to complete missing data, but comparatively little. It seems better to use a cheap joint model in the input space and a complex conditional model for the output space.

## 2/24/97

Paper #1: ``Nonlinear Image Interpolation using Manifold Learning.'' C. Bregler and S. M. Omohundro. Advances in Neural Information Processing Systems 7, 1995, pps. 973-980.

Paper #2: ``A variational principle for model-based interpolation.'' L. K. Saul, and M. I. Jordan. MIT Computational Cognitive Science Technical Report 9703, January, 1997.

Tony Jebara led the discussion and presented some of his recent work on model-based interpolation.

The main difference between the two papers is that Bregler uses a hard manifold while Saul uses a soft density. Saul's interpolant can go anywhere while Bregler's must be on the manifold. Bregler's three algorithms (free-fall, manifold-walk, manifold-snake) are based on repeatedly projecting to the manifold. Saul's variational method explicitly minimizes an objective function based on the length of the path and the probability of each point on the path. However, Saul's method is exact only for unimodal densities and requires significant work and approximations on mixtures. Bregler's algorithms are insensitive to the complexity of the manifold.

Saul has derived an interpolant for Dirichlet densities, which can be used to "average" densities together, e.g. the output density of multiple classifiers.

A problem with the free-fall approach is that it can lead to discontinuous interpolants. Consider the interpolant across a C-shaped one-dimensional manifold.

Bill Freeman suggested modifying manifold-walk to project the desired velocity onto the tangent of the manifold. This would make sure that the path is continuous on the manifold and not be as sensitive to the step size.

A problem with the manifold-snake algorithm is that the smoothness term competes with the underlying manifold, which may be rough. The snake will either take long paths to get around rough spots or it will smooth them over.

Tony Jebara pointed out that the "line tension" term in Saul's objective function can be omitted, since the variance of the density has the same effect as line tension.

Kris Popat noted that maximizing the probability of each point on the path is not the same as maximizing the probability of the path, which is more stringent. An alternative criterion would be to minimize the description length of the path, which could simultaneously force the path to be short and each point to be probable, while being a more general and practical objective.

Kris also pointed out an interesting variant on interpolation, which is to smooth a time-series. This problem can be solved in a similar way, e.g. the mainfold-snake could have an extra term tethering each knot to one of the noisy observations.

## 3/3/97

Paper #1: "Family Discovery" by Stephen Omohundro.
Paper #2: "Surface Learning with Applications to Lipreading".

This paper describes the meta-learning perspective and gives three algorithms based on manifolds in model parameter space. The idea is to restrict the parameter values that the learner can use.

Tom Minka led the discussion. Several difficulties with the manifold approach were identified:

1. The algorithm does not take the number of examples per episode into account. Episodes with many examples should have a large weight in determining the manifold.
2. It is a batch algorithm, requiring all examples from all previous problems to be present. This is impractical in time and space.
3. The criteria for a good manifold fit are quite different than the criteria for good meta-learning. In particular, Euclidean distance on classifier parameters may be meaningless. Even when the parameters are the means of Gaussians, changing these means has a nonlinear impact on classification accuracy.
4. The quality of the manifold model is highly dependent on the parameterization of the classifier. For example, if the linear decision boundary was parameterized by slope and offset instead of the class means, the test problem would be much easier. Of course, representation is always important in learning, but in this case we've coupled the inner workings of the classifier with the meta-learning algorithm.
5. The manifold is not a general-purpose representation of the domain that can be immediately applied to new tasks. This is partially because the algorithm does not work on-line, but also because it is difficult to use the learned manifold with another classifier (with different parameters) or with a different kind of problem, e.g. clustering or noise removal.
6. The manifold projection operator does not appear to be idempotent, as it should be. Consider two defining planes which meet at a right angle. A point inside of the angle will first be projected onto the two planes, then a convex combination of these projections is taken. The output must be on the line connecting the two projections. However, if we re-project any point on this line, it must end up on another line closer to the corner.
(The Canonical Distortion Measure approach, discussed next time, avoids most of these difficulties.)

It is not clear exactly how the meta-learning algorithm was tested. Here are two possibilities:

1. The learner was tested on extra data held out from the original episodes.
2. The learner was trained on a new episode and tested on extra data for that episode.

The end of the paper suggested automatic episode discovery, for problems like changing speakers or changing font. It is interesting that both of these assume episodes are changing in a Markov fashion. In other words, the generative model is to pick an episode (based on the previously chosen episode) and then pick an IID sample from that episode. Note that if episodes were also chosen IID, there would be no way to do segmentation. However, suppose samples were not chosen IID, but based on the previously chosen samples from the same episode. In this case, episodes could be IID. This would correspond to simultaneously hearing two independent conversations. We can still segment the conversations because of the word order. A fourth possibility is for samples and episodes to be Markov; this corresponds to two people repeatedly interrupting each other in conversation.

Click here for a plot of the episodes Omohundro used. The mean of each class is shown as gamma_1 and gamma_2 vary from 0 to 10. The braided pattern arises since the angle of the discriminant varies with the position of the means.

## 3/10/97

Paper #1: "The Canonical Distortion Measure for Vector Quantization and Approximation" by Jonathan Baxter.

This is the original paper on the CDM. It shows how nearest-neighbor can be interpreted as a Bayesian technique where the prior is encoded by the distance metric. Baxter derives it for arbitrary loss functions, but the zero-one loss and quadratic loss are interesting special cases. He learns the metric directly from episodes.

Paper #2: "Is Learning the n-th Thing Any Easier Than Learning The First?" by Sebastian Thrun.

This paper also shows how distance metrics can encode priors obtained from episodes. Thrun's classification metric is the same as the CDM with a zero-one loss function. Besides nearest neighbor, Thrun uses it for kernel regression and the MLP. However, he does not prove Bayes optimality.

My research suggests that many conventional learning algorithms can be suitably generalized so that (1) they are Bayesian and (2) the CDM encodes their prior. A lingering issue is how to learn the CDM. Baxter and Thrun both use MLPs. My Master's thesis used an ultrametric technique called "hierarchy improvement". Lately, I am investigating kernel regression (in the metric space).

Tom Minka led the discussion. Baxter's CDM can be derived from a simple nearest-neighbor argument:

1. You want to choose the neighbor which minimizes your expected loss.
2. Nearest-neighbor chooses the neighbor which minimizes distance.
3. Therefore, set the distance metric to be the expected loss.
The CDM makes clustering in the input space equivalent to clustering in the output space. Hence Baxter's Theorem 1: the reconstruction error of the quantization is exactly the expected approximation error, thus it suffices to minimize the reconstruction error.

The CDM has a particularly nice form when the loss function is squared error. In fact, it is the "variogram" used in geostatistics, given a zero mean process. Optimal linear estimation can be done directly from the variogram, instead of the traditional mean/covariance decomposition. With zero-one loss, the CDM reduces to Pr(f(x) != f(y)), which is Thrun's metric.

The CDM is Euclidean distance when the environment is a uniform distribution of linear functions. This makes sense since the difference in output is always proportional to the difference in input, under a linear model. However, we can also reverse the argument: using Euclidean distance implicitly assumes a linear environment. We can recover the bias or prior behind any nearest-neighbor algorithm, because of the CDM's optimality.

Baxter's approach to learning the CDM requires that each episode has the same sample locations. He explicitly averages the loss between every pair of sample locations. Thrun's approach is more subtle: he trains the network on the actual loss between every pair of samples, for every episode. If a pair appears twice, the network will automatically average the outputs. (I don't think that either method is the best way to train a CDM.)

Thrun's idea of biasing a neural network via slopes, rather than using nearest-neighbor with a different metric, has the advantage of robustness. If the metric is poorly estimated, nearest-neighbor will suffer. If the slopes are wrong, the neural network can still learn as before. You can even use cross-validation to determine how much the slopes should count. However, there are some drawbacks:

• When the metric is well-estimated, nearest-neighbor is the better choice since it gives the domain knowledge more impact.
• Assuming that the current task has the same derivative as previous tasks is not mathematically well-motivated (yet).
• Modeling the slope of the CDM is not necessarily easier than modeling the CDM itself. The slopes you get while the CDM is poorly estimated might also be garbage; it depends on the modeling technique.

Thrun's object recognition task is intriguing since it also can be treated as a base-level learning task. No image was simultaneously labeled as two different classes. Therefore, we could throw all of the images into a common pool of labeled examples and apply a traditional classifier. Interestingly, this classifier does worse than the one that results from meta-learning across the object classes. This suggests that if we take our normal classification tasks and treat them as meta-learning problems, we can improve classification performance. Paradox? Not really. What the meta-learning algorithm is doing is forcing each class to have a similar structure. This is a well-known regularizer, e.g. in covariance estimation (Hoffbeck and Landgrebe, PAMI Sept'96). Thus we see how meta-learning algorithms can help us with hierarchical structure, even in our base-level tasks.

### 3/17/97

Paper: "Separating Style and Content" by Josh Tenenbaum and Bill Freeman. Appears in NIPS 9.

This paper explores a two-component linear factorial model, e.g. how independent choice of font and letter combine to form an image. When the style and content of examples are known, this reduces to SVD. When content is not known, an EM algorithm effectively recovers correspondence between samples of different styles. This constrained representation permits a number of interesting applications.

Josh and Bill led the discussion. They emphasize that the bilinear model is simple to fit and yet general enough to be used for regression ("extrapolation") or clustering ("classification" and "translation") tasks. Indeed, general-purpose models of bias are an important topic in meta-learning.

It is useful for intuition to consider the case with the largest possible dimensionality. In this case, filling the W, A, or B matrix with the training images is a valid model. If we use a B matrix, then a new row (all of the images in a style) must be a linear combination of existing rows. Using an A matrix is the same except that the Nth pixel of the images in a style is considered a seperate row in the combination. If we use a W matrix, then not only must each new row of images be a linear combination of existing rows, but also each new column of images must be a linear combination of existing columns. Thus all of the new content in a style is forced to be a linear combination of the observed content in that same style.

Solving the extrapolation problem with this kind of bilinear model is formally identical to doing regression across styles using a linear model. The input is a tuple of images across styles and the output is the image in the new style, in the same content. The choice between an A matrix or a B matrix is a choice in the form of the regression model, i.e. whether it uses individual pixels or entire images as its unit.

I like to interpret the extrapolation, classification, and translation problems in meta-learning terminology. In each case, we are given a context set of N functions, either represented in their entirety or by samples. In extrapolation, we want to complete the (N+1)st function from samples, using this background knowledge. The solution used here is to force the new function to be a linear combination of previous functions. Note that this works for any functional form which is linear in its parameters. For example, any polynomial of degree k can be represented as a sum of given polynomials of degree k, as long as we have enough of them.

The next two problems are novel in the meta-learning literature. In classification, we want to find a function which produces the given output values somewhere in the input space of content vectors. This requires that the context functions be easily invertible, which the linear model handily provides for us. The solution used here is to cluster the images under a constraint on where the cluster means can be. The cluster mean corresponds to the position in the input space. The translation problem is identical to classification except that the context functions are represented by samples instead of in their entirety, i.e. there is missing data. The solution used here is to use regression imputation to complete the context functions and then to perform classification. With the W matrix model, the regression simply takes linear combinations of the observed samples. The observed samples must have a fairly high dimensionality for this to make sense.

Classification in a new style is a rather unusual learning task, since it violates the widespread assumption that the test data comes from the same distribution as the training data. In other words, we can't build a function which maps individual images into content classes based on the training data alone. Tenenbaum and Freeman's trick is to not act on individual images, but on the test set as a whole. They cluster the test data while assigning content classes. Thus the size of the test set influences performance on any particular member of the test set! That this is counterintuitive is indicative of how widespread the assumption is.

Here is a fourth task: content classification in the existing styles. The style of the test point is not given and the actual content need not be an existing content class. This does not violate the same distribution assumption and can be solved by a conventional regression algorithm. In particular, if we use a mixture of experts with linear combination-type experts (one for each style) we get a solution very similar to the SMM.

### 3/24/97

Paper: "Factorial Learning and the EM Algorithm" by Zoubin Ghahramani. Abstract and postscript

This paper gives EM algorithms for factorial learning, in the context of an additive model called Cooperative Vector Quanization. It is interesting to see why these models are hard to fit by traditional methods, giving some motivation for the Helmholtz machine. Analyzing the corresponding Bayesian network gives some clues to CVQ's intractability.

Optional: "Autoencoders, Minimum Description Length, and Helmholtz Free Energy" by Hinton and Zemel. Appears in NIPS 6.

Tom Minka led the discussion. Hinton and Zemel introduced the FVQ model as an attempt to blend factorial PCA and local VQ. Using the bits-back argument, they say that autoencoders should be stochastic in order to truly minimize description length. The resulting autoencoder is a special type of Boltzmann machine (a Markov random field with pairwise cliques, sampled by the Gibbs technique, which can be likened to a stochastic perceptron network). Hinton and Zemel give a gradient algorithm for training this Boltzmann autoencoder, which would later evolve into the Helmholtz machine (a further restricted class of Boltzmann machines). Note that a Boltzmann machine is itself a special type of undirected Bayesian network.

Ghahramani introduced an alternative approach to FVQ, called CVQ, and asked if it could be fit with EM. It can, but the exact E-step has exponential cost. He suggests Gibbs sampling and mean-field approximation as two alternatives for the E-step. The M-step is easy.

Hinton and Zemel modeled the FVQ as an undirected graph, as indicated by their appeal to the Boltzmann machine. Ghahramani modeled the CVQ as a directed graph, as indicated by these statements:

• When the output is unclamped (unobserved), the causes are independent.
• When the output is clamped (observed), the causes are dependent.
His CVQ model is the simplest kind of "conditional Gaussian" Bayesian network, where a continuous Gaussian node has multiple discrete parents. The first algorithm is Pearl's exact inference algorithm, which we know to take exponential time. This can also be seen from the minimal consistent undirected graph, which is fully connected. Gibbs sampling has been used by Radford Neal. Mean-field, however, is novel and is a crossover idea from neural networks to Bayesian networks. Gradient descent is another crossover; see the paper by Binder et al. Perhaps the Helmholtz machine contributes to solving undirected graphs.

The mean-field approach can be described formally, but intuition about when it applies is lacking. Gibbs sampling is also an approximate method, but at least we know that it always converges at infinity. Mean-field does not converge so how else can we motivate it?

CVQ assumes that the conditional probability table (CPT) is additive, but in fact all three of Ghahramani's approaches, plus Binder's gradient descent, work with any CPT. For example, we can use a multiplicative model as in "Separating Style and Content." The CVQ also assumes uniform priors on the causes, but this need not be true, in fact the priors can evolve in time, e.g. according to a Markov process, giving us a "cooperative HMM." By changing the CPT we can make a "bilinear HMM." It should also be possible to make the causes continuous-valued and still remain in the reach of Bayesian network algorithms.

It is heartening to see that these different schools of unsupervised learning have converged on the same algorithms. It gives us confidence in equating unsupervised learning with unconditional density estimation and in Bayesian networks as a unifying formalism.

### 4/1/97

No paper this week. Tom Minka presented an overview of statistical learning and suggested some fundamental issues for research, which can be found here.

Nuria mentioned two simple ways that a prior can be incorporated in the learning process. One way is generate-and-test: sample different solutions from a conventional learning algorithm and take the one with most probability under the prior. (Could vote them instead.) Another way is to project the maximum likelihood solution onto a manifold of highly likely solutions, used by Omohundro (see above). Iterative algorithms like EM can project onto the manifold at every iteration, simulating Bregler's "manifold-walk."

Meta-learning algorithms have a direct correspondence with "universal" coding algorithms, such as LZW, used in UNIX compress and gzip. You can find a nice LZW tutorial here. Communication theory studies complex encoding and simple decoding. Learning theory studies simple encoding and complex decoding.

Kris Popat pointed out that many algorithms learn structure, as opposed to learning parameters. How are we to interpret this? One way is to call the structure more parameters, noting that not all combinations of the augmented parameters are valid. This corresponds to a prior on the augmented parameters. What form does it have? Is there a way to recover it automatically (e.g. by sampling from the learner)? Under this theory, a modification of EM which creates new clusters should be equivalent to a clustering algorithm with a fixed number of constrained parameters. What are they?

Bayesian networks emphasize conditional independence as a way of abstracting a probability distribution. The abstraction is useful because the structure and complexity of an algorithm for probability evaluation is usually determined by independence relationships. The same seems to hold for learning algorithms and models, i.e. many algorithms, such as Bayesian learning, EM, and meta-learning, as well as many models, like mixtures, factorial models, and tree-structured models, are essentially exploiting an assumption of independence. We saw last time how the algorithms for the CVQ architecture came from the structure of the underlying network and not from the particular CPT. It might be useful to generalize other learning algorithms in this way by ignoring the particular CPTs they use and focusing on the independence relationships. This would be a way to identify classes of learning algorithms, like the class of "mixtures of experts" algorithms.

Tom presented an overview of feature selection research, which seems to use three main approaches:

Wrapper
Feature selection uses the learning algorithm as a subroutine. Search through feature sets either heuristically or by brute force, trying to maximize the cross-validation performance of the learner trained with that feature set. This offers the best performance but at the highest price. Sequential Forward Selection and Sequential Backward Selection are wrapper methods.

Joint
Feature selection happens concurrently with training. If we think of an optimization surface where one dimension is the learning parameters and another is the feature set, then this method does a search in the joint space, while wrappers iteratively fix one dimension (the feature set) and search through the other dimension. Neural networks are well suited to joint methods since feature selection can be subsumed by the network weights. Optimal Brain Damage and Optimal Brain Surgeon are joint methods.

Filter
Feature selection happens prior to training. The idea is to avoid search by optimizing performance on a simple learning model, for which feature salience is known in closed form. Using the Fisher information matrix is an example, where a linear classifier is the assumed learning model. This offers low-quality but fast feature selection. When there are a lot of features, it makes sense to apply a conservative filter method first and then a wrapper method on the rest.

Feature construction and feature modification (like discretization) seem to follow a similar trend. For example, discriminant analysis is a filter-based feature construction method while learning a covariance matrix during clustering is a joint feature construction method.

### 4/7/97

Paper #1: "Optimal Brain Damage" by Le Cun, Denker, and Solla. Appears in NIPS 2.

Paper #2: "Optimal Brain Surgeon" by Hassibi and Stork. Appears in NIPS 5.

Paper #3: "Optimal Brain Surgeon: Extensions and Performance Comparisons" by Hassibi, Stork, Wolff, and Watanabe. Appears in NIPS 6.

Paper #4: "Fast Pruning Using Principal Components" by Levin, Leen, and Moody. Appears in NIPS 6.

These are short and simple papers about feature selection with neural networks. They are actually more general in that they remove parameters as well as features. The fourth paper is most general in that it removes effective parameters as opposed to physical weights, i.e. it causes some of the weights to be linearly dependent.

Tom Minka led the discussion. Network pruning algorithms are joint-type feature selection algorithms, since they do selection while tuning the learning parameters. A wrapper-type algorithm would prune the network first and then train it. It is interesting to see why the joint approach works better in practice. The idea of pruning a grossly overfitted structure has been used in several places:

• Decision trees, e.g. CART
• Best-first model merging, e.g. for learning HMM structure
• Dropping low-valued eigenvalues from a covariance matrix
Some possible reasons for its superiority are:
• Small structures do not have enough slack to model noise. Large structure will explicitly fit the noise, which we can then remove from the model. It is easier to model both signal and noise, then throw away the noise, rather than model the signal directly.

• Learning naturally requires more space than the final model. This point was made about human learning in Minsky's Society of Mind. At the start of the learning process, we build overly complex models, full of exceptions. Then we gradually simplify our model and recover the underlying theory behind the examples. By training a large network, you give it enough slack to build the inefficient concepts on the way to the right concept.

Optimal Brain Surgeon can be derived from the Bayesian approach to neural networks. In the Bayesian approach, we have a posterior distribution over weights, not just one set of weights. With an equal prior over weights, the distribution is given by the likelihood of the data, equivalently the exponential of the negative error. Hassibi and Stork approximate this posterior with a Gaussian, centered at the maximum likelihood solution, whose variance is the inverse of the error Hessian. Our goal is to set one of the weights to zero, while maximizing the data likelihood. This is equivalent to finding the most probable point on the posterior, conditioned on weight k being zero. Since the posterior is Gaussian, this is the same as finding the conditional mean of the posterior. Hence we get the OBS formula.

Various other pruning techniques can be derived from other approximations to the posterior, e.g. a reduced-rank Gaussian or a mixture of Gaussians. Each component center might be the weights resulting from a training run from different initial conditions. It isn't clear whether Principal Components Pruning can be interpreted in this way.

Chris Bishop's book, Neural Networks for Pattern Recognition, discusses these network pruning algorithms, methods for computing the Hessian and inverse Hessian, and the Bayesian approach to neural networks.

### 4/14/97

Paper: "Distributional Clustering of English Words" by Fernando Pereira and Naftali Tishby and Lillian Lee

This is a warm-up paper for my practice presentation on clustering algorithms. This paper is interesting because:

1. It performs clustering in an unusual space: the space of one-dimensional probability densities.
2. It uses the deterministic annealing approach to optimization.
3. It formulates clustering as minimizing an objective function (not all clustering algorithms do this).
4. It formulates clustering as fitting a statistical model (not all that do (3) also do this).
5. It suggests the application of thesaurus construction in document retrieval, where clustering is heavily used.
6. More generally, clustering is being used here to do "local learning", a common application.

### 4/22/97

Paper: "Toward Optimal Feature Selection" by Daphne Koller and Mehran Sahami

This paper presents a filter-type feature selection method which is capable of eliminating redundant features, using a novel Markov blanket technique. It can also be viewed as an algorithm for removing links from Bayesian networks.

Tom Minka led the discussion. Feature selection is essentially a joint search in the space of feature sets and classifier parameters. Some algorithms, like Optimal Brain Surgeon, search directly in this space. Wrapper algorithms are a special case where we fix the feature set while optimizing the parameters. Filter algorithms are a further special case where the algorithm we are wrapping is oversimplified so that we can do the search efficiently. For example, if we assume a linear classifier then feature salience can be computed directly.

The authors argue that a filter method should not simply minimize classification error, since the classification model is simplified. Instead, it should minimize distortion in the class posterior. Unfortunately, this, too, corresponds to a simplified classification model and may preserve irrelevant information at the expense of relevant information. However, given that we don't know what the relevant information is yet, this equal-opportunity approach seems reasonable.

Directly minimizing this criterion is infeasible since the class posterior has too many dimensions. With m binary features, the probability table for P(C | F) has 2^m independent parameters to be estimated. Fortunately, it is sufficient to eliminate features which possess Markov blankets and these features can be eliminated in any order (the algorithm is "confluent"). Unfortunately, testing a Markov blanket requires evaluating P(F-M, C | M), which has essentially the same number of independent parameters.

An approximation to finding a Markov blanket is to find a separator set which renders C and Fi independent. This approximation gives up confluence and may eliminate the wrong features. But it seems to do well in practice. The final algorithm is very simple and doesn't involve Markov blankets. Just pick an arbitrary set of K+1 features and see if the class posterior doesn't change if one of them is dropped. This is a wrapper algorithm over a simple (but not naive) Bayesian classifier which makes inferences from histograms. It works because the classifier only uses K+1 features at a time.

This algorithm can be interpreted as a heuristic for eliminating links from a Bayesian network. You do feature selection by starting with a fully connected net and removing links to C. However, there are other algorithms for learning Bayesian networks, of which removing links is a special case; see Heckerman's tutorial. Koller's heuristic is nice since it only requires conditional independence assessments over small sets of variables. It seems like it should be possible to recover the complete structure of a Bayesian network just from joint distributions over triples of variables.

More generally, it is interesting to note how algorithms dealing with features have "duals" which deal with samples. For example, Duda and Hart point out that clustering can be applied to features as well as samples. A feature might be represented as a histogram while a point as a vector. Clustering can be used for constructing "prototypical features" just as it is used for constructing prototypical samples. Clustering can also be used in a divide and conquer setting where experts compete for features just as they can compete for samples. Similarly, features can be selected and samples can be selected. The dual of forward feature selection might be "active learning" or learning with queries, e.g. the Query by Committee algorithm. The dual of backward feature selection is dataset pruning, often used in conjunction with nearest neighbor classification. A joint search method for sample selection is like robust statistics, where we fit a model while discarding data. A wrapper method would search through small datasets for the one which maximizes performance. A filter method would try to efficiently compute the "salience" of a data point or try to find redundant data points.

This paper contains an additional argument for model pruning (cf 4/7). Pruning algorithms try to take small steps away from the "correct" answer, while growing algorithms try to take large steps away from the "wrong" answer. There is no guarantee that these big steps will pan out in the end. However, growing algorithms tend to be cheaper since they are working with smaller models.

### 4/28/97

Paper #1: "Self-Organization as an Iterative Kernel Smoothing Process" by Filip Mulier and Vladimir Cherkassky. Appears in Neural Computation 7(6).

Paper #2: "A Growing Neural Gas Network Learns Topologies" by Bernd Fritzke.

These are modern interpretations and variants of the Kohonen self-organizing map for unsupervised learning. The self-organizing map combines the two most popular unsupervised techniques: principal component analysis and vector quantization. It can be thought of as clustering with an ordinal cluster index, which means you can use the resulting cluster values in more powerful ways. The self-organizing map is superior in most of the applications of clustering: nonlinear feature extraction, creating a new distance metric, and visualization (it competes with multidimensional scaling). It can also be used instead of PCA for (nonlinear) dimensionality reduction, model-based interpolation, and regularization (because it learns a parameterized manifold; cf 2/24).

Tom Minka led the discussion. Mulier and Cherkassky show how an iteration of batch SOM is the same as an iteration of LBG followed by a "smoothing" of the center locations. The smoothing is not determined by a soft assignment of the samples, as if we were doing deterministic annealing, but rather by a neighborhood relationship between the centers that is set up beforehand. The smoothing can be interpreted in several ways:

1. Kohonen interpreted it as "crosstalk" between neighboring neurons, where activity in one neuron would cause activity in other, possibly unrelated neurons. The neighborhood relationship is literally the adjacency of neurons in the brain.
2. Luttrell interpreted it as a noise model on the center numbers, where a sample might accidentally be encoded as a prototype which isn't the nearest, hence influencing the optimal position of other prototypes. The neighborhood relationship is given by the noise model.
3. You can also think of the combined set of center locations as a random vector upon which we are imposing a restrictive prior distribution. The neighborhood relationship is given by the prior.

The smoothing provides the advantage of a continuous mapping, i.e. we can guess where a new center might go, simply by knowing its neighbors. Thus we have a continuous map from center number to input space. This property also makes it easy to add new units dynamically, i.e. have a "growing" SOM.

Traditionally, the smoothing has always been kernel smoothing, where the kernel has two degrees of freedom: the weight of the nearest center and the weight of a neighboring center. The authors suggests expanding our repetiore, e.g. to locally linear smoothing. Interestingly, a SOM with linear smoothing looks a lot like Hastie and Tibshirani's "principal curve" algorithms. It also bears similarity to Omohundro's locally linear manifolds.

It would be interesting to see what the corresponding priors over center locations are for different smoothers. Kernel smoothing implies a Gaussian prior. What about locally linear smoothing?

Fritzke's main contribution (building on Martinetz) is a way to learn the neighborhood relationship. The learning procedure is incredibly simple: if two centers are frequently the two closest centers to a sample, make them neighbors. If two centers are not the two closest frequently enough, don't make them neighbors. In this case, the input space is driving the neighborhoods, but in a long-term manner. This simple learning rule is quite effective; see the Java demo. I'd like to see a statistical interpretation of this rule.

### 5/5/97

Paper #1: "GTM: A Principled Alternative to the Self-Organizing Map" by Christopher M. Bishop, Markus Svensen and Christopher K. I. Williams. Appears in NIPS 9. GTM home page

Paper #2: "A Bayesian Analysis of Self-Organizing Maps" by Stephen Luttrell. Appears in Neural Computation 6(5).

Tom Minka led the discussion, with help from Josh Tenenbaum. It is helpful to remember that the GTM is an alternative to the SOM, not a Bayesian model of a SOM (as attempted by Luttrell). There are particular parameter settings of the GTM which make it similar to a SOM, but these are mostly accidental similarities.

The GTM is a generative statistical model. It starts with a latent space, usually one or two dimensional, which is mapped into a higher-dimensional activation space and then subjected to a linear transformation W to get a manifold in the input space. Data is generated by sampling from a spherical Gaussian convolved with this manifold. By changing W, we can move the manifold around to make the observed data more likely. As the authors point out, if the activation space is the same as the latent space, then the manifold is always a hyperplane, and we are just doing factor analysis. Otherwise, the manifold can be nonlinear, generalizing factor analysis.

As a computational device, the latent space is evenly sampled, resulting in a sampled manifold in the input space. This gives a finite Gaussian mixture model in the input space whose means are constrained by the allowable transformations W. It is now straightforward to derive an EM algorithm for adjusting W. Interestingly, this is the same algorithm used by the asymmetric style/content model (see 3/17) where W is the style matrix and the activations are the content vectors.

Taking low-dimensional inputs through a nonlinear mapping into a high-dimensional space and then back down through a linear transformation should sound familiar. This is the same model for a multilayer perceptron. The difference is that here an MLP is being used for unsupervised learning. How does this work? We are given a set of input values (samples of the latent space) and a set of output values (data), but we don't know which inputs go with which outputs. So we guess the correspondence, apply the normal training algorithm (least squares), and iterate. This is a handy trick for when you have a black box for supervised learning and want to apply it to unsupervised learning.

Note that the GTM does not modify the activations during fitting. Using the MLP analogy above, this should be easy to do, though in practice we may not want to. The reason is that if the mapping is too powerful, it can satisfy any correspondence, making it useless for recovering structure in the data. The activations might be learned in a separate supervised stage, as in the style/content paper.

The authors note that the M-step of the GTM algorithm can be decomposed into a centroid computation followed by a matrix multiplication. However, they incorrectly claim that this is a smoothing matrix, as in the SOM (cf 4/28). The matrix is idempotent, hence it is a projection matrix, not a smoothing matrix. Sure, the rows of the matrix look like bumps, but this is because of the particular activation functions chosen.

The activation functions chosen by the authors are actually very important to get SOM-like behavior. A point in the latent space is mapped into a vector containing effectively the distance between it and a set of canonical points. If the latent space is one-dimensional, then as we move in the latent space our activation will trace out a parabola (actually, it is a Gaussian, but no matter). For different canonical points, the activation may shift or stretch, but it will still be a parabola. Thus the basis functions for representing the manifold are translates and dilates of a parabola. Linear combinations of these give a quadratic spline. This is why the GTM always produces smooth manifolds in a snake-like way. Other choices of activation functions need not be this way.

The initialization of the GTM is also important for good performance. The authors propose computing the best-fit D-dimensional PCA hyperplane in the input space if the latent space has D dimensions. This hyperplane is then regularly sampled and the samples are placed in correspondence with the sampling of the latent space. Thus the GTM "snake" is first unraveled and spread out across the data and then allowed to curl up to fit the deviations from the PCA hyperplane. The sinewave example is a particularly good demonstration of this. Unfortunately, if the snake is initialized in a slightly different way, e.g. if it is not stretched completely across the data, it will fall into a tangled-up local minimum.

The authors claim that by estimating the variance of the Gaussians during convergence, they achieve a kind of automatic annealing. However, annealing is typically used to avoid local minima, as in the SOM, while in the GTM this automatic annealing seems to make it much more susceptible. The reason is that the variance parameter seems to decay very fast, gluing the centers to any data points they happen to be near to at initialization.

#### Paper #2

Luttrell's paper attempts to formalize the SOM using a noisy encoding argument, mentioned last time. The idea is to minimize the squared error between x and its reconstruction. Our only degree of freedom is the probabilistic mapping from x to y; the map from y to x is constrained to be the Bayes' rule inverse. Such an encoder/decoder constraint is what Luttrell calls a folded Markov chain. Furthermore, the noise model from y to y' is also fixed.

If there is no noise, Luttrell shows that the optimal mapping from x to y is deterministic: always pick the y whose centroid is closest to x. This is the familiar nearest-neighbor encoding rule in vector quantization. If there is noise, the optimal mapping is still deterministic, however the optimal y is not the nearest neighbor anymore. Instead, we want to minimize the distance between x and all of the centroids that it might be reconstructed from (due to noise). This is called the "minimum distortion" encoding rule.

There is a useful identity which says that the squared distance between a point and a set of points is the same as the squared distance between the point and the mean plus the variance of the points about the mean. For each cluster y we have a (fuzzy) set z of clusters it might be corrupted to. The minimum distortion rule is therefore trying to pick a y whose z has the smallest variance and has a centroid close to x. Thus a y might be very close to x but have such a large noise variance that it is not the best choice.

That takes care of the encoding rule. What about decoding, i.e. where are the centers? It turns out that the rule for recomputing a center is not simply to take the average of all points assigned to that center. Instead, some of the points assigned to other centers enter into the average. This makes sense because points assigned to another center could be mistakenly encoded as being in my center, because of the noise. The center update rule ends up having exactly the same form as the batch SOM update rule given in Mulier and Cherkassky, where the "kernel" is now a row of the noise transition matrix.

This gives us confidence in saying that we have found a Bayesian interpretation of the SOM. Unfortunately, there is a problem. While the centroid update rule is the same, the assignment rule is different. Should the batch SOM be using a minimum distortion rule? Seems like not, since then nodes with lots of connections, i.e. high noise, would be disfavored. Someone should try and find out.

One may ask why it is useful to put the SOM in a Bayesian framework. One reason is meta-learning: by expressing the algorithm's prior knowledge with a density, we have a chance of learning that density. For example, Fritzke's growing neural gas might be interpretable as "learning the noise model." Another reason is that once we have an objective function, we can apply other well-known algorithms for optimization, such as gradient descent or deterministic annealing. Finally, statistical models are modular in the sense that they can easily be embedded into mixtures or hierarchies.

Luttrell exploits this last benefit by giving an analysis of a "partitioned" folded Markov chain, where different dimensions of the input have different quantizers. This is similar to the "modular eigenspaces" used in face coding. However, Luttrell also has a noise model which corrupts the encoding as a whole, not in pieces. For example, the encoding of two partitions might be swapped. Thus the partitions become dependent because of the noise, and must be able to encode the others' data to some extent.

The folded Markov chain bears some similarity to the Helmholtz machine. However, as pointed out by Josh, while the Helmholtz machine is trained so that the encoder and decoder become Bayes' rule inverses, they are not constrained to be so. Thus the Helmholtz machine lies between a folded Markov chain and an autoassociator network with separate encoding and decoding stages.

Possible future topics:
• Temporal models like recurrent neural networks.
• Stuart Russell's gradient algorithm for training Bayesian networks.
• Bayesian methods for training multilayer perceptrons.
• Boosting and weighted majority (techniques derived from computational learning theory).

Thomas Minka