Lectures

# Lecture 22, March 31st, 2016: Approximate Inference

In this lecture we will continue our discussion of probabilistic modelling and turn our attention to approximate inference.

Please study the following material in preparation for the class:

• Chapter 19 of the Deep Learning Textbook on approximate inference.

In preparation for the following lecture, please study this paper as well, mentioned already in class:

Standard

## 15 thoughts on “Lecture 22, March 31st, 2016: Approximate Inference”

1. Has anyone thought about ways to optimize deep generative models so that they prefer matching the CDF of the true distribution over matching the PDF (as in maximum likelihood). I think that estimating the correct CDF is important in domains where we care about risk or where we’re estimating quantities.

For example, suppose I estimate data from a poisson distribution but move all of the density on even numbers to the nearest odd numbers. This estimating distribution should have terrible likelihood (probably 0.0) but might still do a fine job of matching the CDF. However if I re-sample from a poisson with a smaller variance, my likelihood could be fine, but the CDF at the tails will be completely wrong. This is because the CDF is an accumulation of the PDF, so the PDF being consistently wrong (over-estimating or under-estimating) will add up to a larger CDF error than a PDF which makes a mix of over-estimating and under-estimating errors.

I am curious about what can be done to remedy this situation. An obvious solution is to optimize for quantile loss or do an ordinal regression, but this doesn’t scale well with the dimensionality of the space we’re modeling.

For denoising autoencoders, I wonder if there’s a way to look at how large the gap is between input images and their reconstructions to quantify how much mass the autoencoder moves from low-density regions, and then learn to reweight the samples so that the CDF is more likely to be calibrated.

Liked by 2 people

• Interesting idea. Do you know of any algorithm that aims to match the CDF defined by the model with the true CDF?
To me it looks like a very hard task for high dimensional data. I don’t see any obvious and efficient way to compute/estimate the CDF.

Liked by 1 person

• The only approaches that I know of are ordinal regression and quantile regression. Basically ordinal regression involves picking a bunch of cutoffs 1,2,3,4,…,n and then learning a bunch of classifiers for p(x <= 1), p(x <= 2), …, which implicitly defines a CDF. With neural networks these classifiers could all share parameters.

In quantile regression you instead run a regression for each quantile that use a specialized loss that encourages it to correspond to the inverse-cdf at the particular quantile. This learns an inverse-CDF, which is equivalent to learning a CDF.

I think that it’s worth thinking about the multivariate case with a real world example. Suppose that you want to build a weather forecasting system that models rainfall one week in the future over all of North America. One reasonable question is: “What is the probability that Montreal will have an average rainfall >= 3 inches”? Likewise we might want to ask the system “What is the probability that South Quebec will have an average rainfall >= 3 inches”? These are both questions that we can ask about different subsets of the overall multivariate cdf.

I think that we can try to come up with a model where the cdf will tend to be calibrated, even if actually computing the CDF in higher dimensions is intractable.

Like

2. Martin Lavoie says:

Not a very specific question here, could you explain a bit how approximate inference plugs in the training of some model?

Like

• Vincent says:

Here is what was written on the board :

$\frac{\partial \log P(X)}{\partial \theta} = \frac{\sum_H \frac{\partial P(X,H)}{\partial \theta}}{\sum_H P(X,H)}$

$= \sum_H \frac{P(X,H)}{P(X)} \frac{\partial \log P(X,H)}{\partial \theta}$

$= \sum_H P(H | x) \frac{\partial \log P(X,H)}{\partial \theta}$

$P(X) = \sum_H P(X,H)$

$P(Y|X) = \sum_H P(Y,H|X) = \sum_H \frac{P(Y|H,X)}{P(H|X)}$

Like

• Hello Vincent,
I’m reviewing for the course, and I cannot quite understand this derivation from the first line to the second line, could you please explain how the $\frac{P(X,H)}{P(X)}$ appears from the first line? Thanks a lot

Like

• d_logP/d_O=(1/P)d_P/d_O so d_p/d_O = P * d_logP/d_O

Like

• Vincent says:

To Yifan : you use the following property

$\frac{\partial P(X,H)}{\partial \theta} = P(X,H) = \frac{\log P(X,H)}{\partial \theta}$

Like

• Vincent says:

Little typo in my answer, the second “=” is suppose to be a product.

Like

3. I have two questions:

Question 3 of the 2014 exam asks if early stopping can be applied when the loss cannot be computed cheaply, like in the case of RBMs. For RBMs (well, at least the simple ones we’ve looked at), we can easily compute the likelihood $p(x)$ since the computation of it is linear in the number of hidden units, as opposed to exponential, which is what you would think it would be when you write out the marginalisation naively, e.g. $p(x) = \sum_{h} p(x,h)$. Am I missing something obvious here, in regards to there being a case in which computing $-log(p(x))$ would be more expensive?

In the paper “What regularised auto-encoders learn from the data generating distribution” (http://arxiv.org/pdf/1211.4246v5.pdf) I am not able to follow the proof on page 22, which shows the relationship between the contractive penalty and denoising criterion. The Taylor expansion makes sense to me but I am not sure how you go from the first equation (plugging in the Taylor expansion into the loss function) to the second equation (expanding out the terms).

Like

• My mistake was thinking that the loss was intractable to compute because of the all the number of different configurations of the hidden states, but in this case it’s simply the partition function Z which makes it expensive. You can’t use the unnormalised loss because Z is a function of the parameters of the RBM.

Liked by 1 person

4. Dzmitry Bahdanau says:

In the cases when exact inference can be done, e.g. Gaussian mixture models, people generally use EM-algorithm. On the other hand, as far as I understand, general optimization methods (e.g. gradient descent) can also often be applied in such situations. Is there a good justification why EM should be used whenever it can be used? In particular, do you know what will happen if I try to train a Gaussian mixture model with gradient descent?

Like

5. Bonjour/Hi,
I was reviewing the course of this chapter, and I cannot quite understand the section 19.3 MAP inference, equation 19.11 $q(h|v)=\delta (h-v)$. It is said on the book that here the We can derive MAP inference as a form ofapproximate inference by requiring q to take on a Dirac distribution. Why this specific distribution (Dirac) is chosen here? Thanks a lot.

Like

• I think that the dirac distribution is just a spike at a single point. In this case it’s saying that q(h | v) has all of its density at h = v and no mass anywhere else.

Like