Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)

In this lecture we will discuss two modern generative models, the variational auto-encoders (VAEs) and the generative adversarial networks (GANs).

Please study the following material in preparation for the class:

  • Part of Chapter 20 (sec. 20.9 to 20.10) of the Deep Learning Textbook (deep generative models).
  • Slides on deep generative modeling (26 to 41)



37 thoughts on “Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)

      • The result will depend on your architecture in the two cases 11 or 22. If simply you are changing the size of the final kernel to yield a bigger output map, then yes it will result in less operations, but not by much.


      • As said in class today, the architecture should depend on what you want to do. If the final map of the convolutional part of your network is of size 1by1, then it means that the resulting features have been influenced by the entire input image (ie “global” features). For a task such as classification, I think it is more usual to use global features. But for another task, it might be useful to have a bigger map after the last convolutional layer (detection maybe?).


      • Hi Danlan,
        Actually if you are doing classifications, you will have to use the last conv layer feature maps as features, and feed them into the last layer MLP to do classifications. so it’s better to set to 1*1. And those features are highly abstract due to the depth of the CNN , which might help the performance. If you are doing other tasks such as picture part recognition, you may want to set the last layer feature maps to a larger size.


  1. VAE uses a latent variable z which as some prior p(z) (often a normalized Gaussian) but I don’t see why this prior should hold when we infer z from x. In fact, I think those two distributions (the true prior and the distribution we get when inferring from x) should be different because no term in the cost is used to impose that constrain. The divergence term seems to do it but only to some extends. Finally when generating we sample from the prior of z which end up being not the real distribution of z and might cause some problems. Am I right?

    Liked by 1 person

  2. Could you train a better discriminator in a GAN by giving the discriminator noised/blurred xs and pure noise (i.e. amount of noise could be the target value), in addition to generated xs, that it has to distinguish from real xs? You wouldn’t backprop through the noisy xs to the generator, but I still think it could force the generator to be better, because if it generated an example that looked like a blurry x, the discriminator wouldn’t be fooled, and you’d maybe get a gradient that would encourage sharper/less noisy images. Has this been tried?


    • If I understand correctly, you’re saying that, instead of training a discriminator on two classes: real vs. fake, you train a discriminator on three classes:

      “real vs. fake vs. blurry versions of real images”.

      The generator will still be optimized to make the discriminator think that its fake images are real, but the discriminator is trained to classify between the 3-classes?


    • To summarize what Yoshua said in class: this could work, but it’s possible that the discriminator finds some hack to determine the amount of noise, and that is not very useful to train the generator.

      My intuition is along similar lines: I don’t see how this setup would allow the generator to capture any global structure of the image. For example, if you add Gaussian pixel noise, the discriminator can simply look at small local correlations between pixels to predict the amount of noise. So the generator just has to come up with something that’s ‘fairly smooth’ – it doesn’t actually have to produce things that look like the training data.


      • Also, it was pointed out that depending on the type of noise, the discriminator could use some very simple hacks (eg. the mean or variance of the pixels) to figure out if it’s a real or fake image


      • Actually, this gets me thinking – is there a form of parametric noise that could be added that could lead to the discriminator having to account for global structure to properly calculate it? For example, some kind of rotation of a random circular crop somewhere in the image?


      • Okay, thinking about it even more (apologies for the multi-post), I’m really skeptical that any of these possibilities would do better than the original GAN.

        The best possibility that I can think of is to augment the usual GAN with an extra output on the discriminator – so, you are predicting both whether the input image comes from the model or training data, and also how much noise is in the image (you could also add noise to the model generated images). This could maybe act as some form of regularizer or something (but that’s a big maybe)


      • This could probably in a DCGAN. Most DCGANs produce images from a random vector which is generated on the fly. Our situation is slightly different – we are given a grainy image, and would like the generator to produce an image that not only fools our discriminator, but also looks like the original grainy image to then be fed to another discriminator. We replace the first generator basically with an autoencoder-like structure. The grainy image goes in one end, goes through a series of convolutional features, then is reconstructed with the upscaling structure used in DCGAN. We then downsample the generated image and compare it to the grainy version.. Discriminator 2 would then compare that last image with the clean picture.


    • You start with an equation for an individual hidden unit (eq. 20.14), and multiply over all the hidden states to get the equation for the full hidden layer, hence the product over j. The 2h – 1 for a hidden of 1 will be 1, and a hidden of 0 will be -1. Using the sigmoid identity: sigmoid (-x) = 1 – sigmoid (x) this is essentially syntactic sugar to write the probability for h = 1 or 0 as a single expression. Finally the tensor multiplication is due to the fact that all states are being considered, not a single h_j.


  3. I’ve seen in some article that it is possible to force a VAE to learn specific features in its latent variables (e.g. the angle of an object). I think it would be interesting to go over how this is done in class. Also, could other methods use the same trick to learn specific latent variables?

    Liked by 1 person

    • Slide 26 explains it well. VAE is a sort of generative black box. Before we were trying to approximate data generation using p(x) that we tried to parametrize. VAE parametrizes a black box (aka “machine”) which approximates data generation. Since the former is not really dependent on p(x), VAE makes data generation applicable to a larger range of probabilistic model families (even in the presence of latent variables that are intractable in the posterior distributions)


  4. The textbook describes in words that in VAEs, “Learning then consists solely of maximizing \mathcal{L} with respect to the parameters of the encoder and decoder. All of the expectations in \mathcal{L} may be approximated by Monte Carlo sampling.”

    Could you please write down the mathematical expression for the loss we’re trying to minimize through gradient descent, including in terms of \theta? I expect it will look something like

    \displaystyle \textrm{Loss} \approx \frac{1}{M} \sum\limits_{i=1}^{m} \ldots

    given that we’re using Monte Carlo sampling.


  5. assyatrofimov says:

    Could you please explain how GSNs / DAEs add noise?
    I’m not sure how this affects performance in the end. Isn’t that taken care of higher up?


      • I found a pretty good explanation from Kyunghyun Cho, January 2013, “Boltzmann Machines and Denoising Autoencoders for Image Denoising”:

        “Unlike an ordinary autoencoder, a DAE explicitly sets some of the components of an input vector randomly to zero during learning […]. [This] adds adds noise to an input vector. It is usual to combine two different types of noise, which are additive isotropic Gaussian noise and masking noise [Vincent et al., 2010]. The first type adds a zero-mean Gaussian noise to each input component, while the masking noise sets a set of randomly chosen input components to zeros. Then, the DAE is trained to denoise the corrupted input.”

        Liked by 2 people

      • Noise is good in the sense that values that are close will be considered in the same range (due to noise) thereby not carry different information (i.e. will likely not reconstruct very different inputs)

        Liked by 1 person

      • Is your question specifically about GSN/DAE?
        More generally, yes, there is such a thing as “good noise”! And that’s the cool/wonderful thing 🙂
        A lot of models can benefit from adding some “good noise”. (not just GSN/DAEs)
        One very impressive example (I think) is the following paper
        which shows that adding gradient noise in deep networks improves both optimization and regularization.

        Quote of Yoshua about this paper:
        “It’s really an optimization trick but it regularizes by preferring large-basin areas (otherwise the noise would kick you out of there).”

        Liked by 2 people

      • Vincent says:

        This is my notes from class specifically on Benjamin’s comment :

        The noise in this paper has more to do with optimization procedures (like simulated annealing). Adding noise to the gradient has the effect of regularizing and also helps optimization to get out of local minima.

        Noise in GSN DAE has more to do with carving a good objectif function. It help to learn what is important (that is uncorrupted data point).

        Liked by 1 person

      • I remember where I saw the explanation of why noise is good. Check out Karol Gregor’s lecture on VAE (explains it at t=10min)

        Basically without noise, we may reconstruct two different x1_r and x2_r from two different inputs x1 and x2. With noise, we create overlap since in the latent variables (before reconstruction) since z1 may equal z2 if z1 = u1 + noise1 and z2= u2 + noise2 are within each other’s noise range. If the latent variables overlap, a single variable x_r will be reconstructed. This therefore compressed data.

        Liked by 1 person

      • Thanks for the link Jonathan!

        Looking at that part in the video, another question comes to me: is the reparametrization trick (setting z = mu + sigma * epsilon) exactly equivalent to the regular sampling of z~Normal(mu,sigma)? I always thought it was, but if epsilon is sampled uniformly then it would look more like a cylindrical shape, no?

        Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s