Lectures

Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)

In this lecture we will discuss two modern generative models, the variational auto-encoders (VAEs) and the generative adversarial networks (GANs).

Please study the following material in preparation for the class:

Part of Chapter 20 (sec. 20.9 to 20.10) of the Deep Learning Textbook (deep generative models).
Slides on deep generative modeling (26 to 41)

37 thoughts on “Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)”

Danlan Chen says:

I have a question about convolutional neural network training, most cnn train with the output of 11 output map, but there are also 22 output map, does it influence the training performance? (Like described in overfeatnet paper figure 5 http://arxiv.org/pdf/1312.6229.pdf)

LikeLike

April 6, 2016 at 12:25 pm Reply
- Danlan Chen says:
  
  11 i mean 1 by 1, 22 i mean 2 by 2
  
  LikeLike
  
  April 6, 2016 at 3:06 pm Reply
  - tisu32 says:
    
    The result will depend on your architecture in the two cases 11 or 22. If simply you are changing the size of the final kernel to yield a bigger output map, then yes it will result in less operations, but not by much.
    
    LikeLike
    
    April 11, 2016 at 1:04 pm
  - Guillaume Berger says:
    
    As said in class today, the architecture should depend on what you want to do. If the final map of the convolutional part of your network is of size 1by1, then it means that the resulting features have been influenced by the entire input image (ie “global” features). For a task such as classification, I think it is more usual to use global features. But for another task, it might be useful to have a bigger map after the last convolutional layer (detection maybe?).
    
    LikeLike
    
    April 11, 2016 at 9:12 pm
  - Yifan Nie says:
    
    Hi Danlan,
    Actually if you are doing classifications, you will have to use the last conv layer feature maps as features, and feed them into the last layer MLP to do classifications. so it’s better to set to $1*1$ . And those features are highly abstract due to the depth of the CNN , which might help the performance. If you are doing other tasks such as picture part recognition, you may want to set the last layer feature maps to a larger size.
    
    LikeLike
    
    April 12, 2016 at 1:41 pm
Thomas George says:

This is a question to the class : Has anyone tried using a DCGAN for the vocal synthesis project ?

LikeLike

April 6, 2016 at 2:01 pm Reply
- Anirudh goyal says:
  
  Alex Lamb and I are working on it. Our generator comprises of LSTM layers, and discriminator contains a mix of convolutional and GRU layers.
  
  LikeLiked by 1 person
  
  April 7, 2016 at 12:28 am Reply
  - Anirudh goyal says:
    
    *LSTM layers.
    
    LikeLike
    
    April 7, 2016 at 12:33 am
  - ryanlowe says:
    
    Cool! Do you guys have any code up for it?
    
    LikeLiked by 1 person
    
    April 11, 2016 at 10:19 pm
  - Jonathan says:
    
    Very interesting
    
    LikeLike
    
    April 14, 2016 at 5:00 am
Mathieu Duchesneau says:

VAE uses a latent variable z which as some prior p(z) (often a normalized Gaussian) but I don’t see why this prior should hold when we infer z from x. In fact, I think those two distributions (the true prior and the distribution we get when inferring from x) should be different because no term in the cost is used to impose that constrain. The divergence term seems to do it but only to some extends. Finally when generating we sample from the prior of z which end up being not the real distribution of z and might cause some problems. Am I right?

LikeLiked by 1 person

April 6, 2016 at 7:59 pm Reply
tegan says:

Could you train a better discriminator in a GAN by giving the discriminator noised/blurred xs and pure noise (i.e. amount of noise could be the target value), in addition to generated xs, that it has to distinguish from real xs? You wouldn’t backprop through the noisy xs to the generator, but I still think it could force the generator to be better, because if it generated an example that looked like a blurry x, the discriminator wouldn’t be fooled, and you’d maybe get a gradient that would encourage sharper/less noisy images. Has this been tried?

LikeLike

April 6, 2016 at 9:28 pm Reply
- thenuttynetter says:
  
  If I understand correctly, you’re saying that, instead of training a discriminator on two classes: real vs. fake, you train a discriminator on three classes:
  
  “real vs. fake vs. blurry versions of real images”.
  
  The generator will still be optimized to make the discriminator think that its fake images are real, but the discriminator is trained to classify between the 3-classes?
  
  LikeLike
  
  April 7, 2016 at 8:39 pm Reply
- ryanlowe says:
  
  To summarize what Yoshua said in class: this could work, but it’s possible that the discriminator finds some hack to determine the amount of noise, and that is not very useful to train the generator.
  
  My intuition is along similar lines: I don’t see how this setup would allow the generator to capture any global structure of the image. For example, if you add Gaussian pixel noise, the discriminator can simply look at small local correlations between pixels to predict the amount of noise. So the generator just has to come up with something that’s ‘fairly smooth’ – it doesn’t actually have to produce things that look like the training data.
  
  LikeLike
  
  April 11, 2016 at 2:26 pm Reply
  - ryanlowe says:
    
    Also, it was pointed out that depending on the type of noise, the discriminator could use some very simple hacks (eg. the mean or variance of the pixels) to figure out if it’s a real or fake image
    
    LikeLike
    
    April 11, 2016 at 10:21 pm
  - ryanlowe says:
    
    Actually, this gets me thinking – is there a form of parametric noise that could be added that could lead to the discriminator having to account for global structure to properly calculate it? For example, some kind of rotation of a random circular crop somewhere in the image?
    
    LikeLike
    
    April 11, 2016 at 10:23 pm
  - ryanlowe says:
    
    Okay, thinking about it even more (apologies for the multi-post), I’m really skeptical that any of these possibilities would do better than the original GAN.
    
    The best possibility that I can think of is to augment the usual GAN with an extra output on the discriminator – so, you are predicting both whether the input image comes from the model or training data, and also how much noise is in the image (you could also add noise to the model generated images). This could maybe act as some form of regularizer or something (but that’s a big maybe)
    
    LikeLike
    
    April 11, 2016 at 10:28 pm
  - Jonathan says:
    
    This could probably in a DCGAN. Most DCGANs produce images from a random vector which is generated on the fly. Our situation is slightly different – we are given a grainy image, and would like the generator to produce an image that not only fools our discriminator, but also looks like the original grainy image to then be fed to another discriminator. We replace the first generator basically with an autoencoder-like structure. The grainy image goes in one end, goes through a series of convolutional features, then is reconstructed with the upscaling structure used in DCGAN. We then downsample the generated image and compare it to the grainy version.. Discriminator 2 would then compare that last image with the clean picture.
    
    LikeLike
    
    April 14, 2016 at 5:15 am
Danlan Chen says:

I do not understand in the book chapter 20.2.1, how equation 20.14 comes to equation 20.15?

LikeLike

April 6, 2016 at 11:28 pm Reply
- tisu32 says:
  
  You start with an equation for an individual hidden unit (eq. 20.14), and multiply over all the hidden states to get the equation for the full hidden layer, hence the product over j. The 2h – 1 for a hidden of 1 will be 1, and a hidden of 0 will be -1. Using the sigmoid identity: sigmoid (-x) = 1 – sigmoid (x) this is essentially syntactic sugar to write the probability for h = 1 or 0 as a single expression. Finally the tensor multiplication is due to the fact that all states are being considered, not a single h_j.
  
  LikeLike
  
  April 7, 2016 at 9:16 am Reply
etienneift6266 says:

I’ve seen in some article that it is possible to force a VAE to learn specific features in its latent variables (e.g. the angle of an object). I think it would be interesting to go over how this is done in class. Also, could other methods use the same trick to learn specific latent variables?

LikeLiked by 1 person

April 7, 2016 at 6:11 am Reply
- etienneift6266 says:
  
  Here is an example of what I was talking about :
  
  Click to access 5851-deep-convolutional-inverse-graphics-network.pdf
  
  LikeLike
  
  April 7, 2016 at 6:14 am Reply
  - tisu32 says:
    
    Its an interesting trick. Maybe another way to get a similar result would be to truncate the network, learn a vector of targets for the latents, corresponding the features we want to learn, and the reconnect the upper layers and resume training. Not sure how I would actually make that work in practice.
    
    LikeLike
    
    April 11, 2016 at 1:15 pm
Jonathan says:

Why is the VAE defined for arbitrary computational graphs? Is this still true in the presence of latent variables with intractable posterior distributions?

LikeLike

April 7, 2016 at 7:28 am Reply
- Jonathan says:
  
  Slide 26 explains it well. VAE is a sort of generative black box. Before we were trying to approximate data generation using p(x) that we tried to parametrize. VAE parametrizes a black box (aka “machine”) which approximates data generation. Since the former is not really dependent on p(x), VAE makes data generation applicable to a larger range of probabilistic model families (even in the presence of latent variables that are intractable in the posterior distributions)
  
  LikeLike
  
  April 10, 2016 at 9:19 am Reply
Olexa Bilaniuk says:

The textbook describes in words that in VAEs, “Learning then consists solely of maximizing $\mathcal{L}$ with respect to the parameters of the encoder and decoder. All of the expectations in $\mathcal{L}$ may be approximated by Monte Carlo sampling.”

Could you please write down the mathematical expression for the loss we’re trying to minimize through gradient descent, including in terms of $\theta$ ? I expect it will look something like

$\displaystyle \textrm{Loss} \approx \frac{1}{M} \sum\limits_{i=1}^{m} \ldots$

given that we’re using Monte Carlo sampling.

LikeLike

April 7, 2016 at 8:06 am Reply
assyatrofimov says:

Help!
Could you please explain how GSNs / DAEs add noise?
I’m not sure how this affects performance in the end. Isn’t that taken care of higher up?

LikeLike

April 7, 2016 at 8:07 am Reply
- assyatrofimov says:
  
  Correction to my question:
  Is there such a thing as “good noise”? Is noise really helping?
  
  LikeLike
  
  April 7, 2016 at 10:05 am Reply
  - Jonathan says:
    
    I found a pretty good explanation from Kyunghyun Cho, January 2013, “Boltzmann Machines and Denoising Autoencoders for Image Denoising”:
    
    “Unlike an ordinary autoencoder, a DAE explicitly sets some of the components of an input vector randomly to zero during learning […]. [This] adds adds noise to an input vector. It is usual to combine two different types of noise, which are additive isotropic Gaussian noise and masking noise [Vincent et al., 2010]. The first type adds a zero-mean Gaussian noise to each input component, while the masking noise sets a set of randomly chosen input components to zeros. Then, the DAE is trained to denoise the corrupted input.”
    
    LikeLiked by 2 people
    
    April 10, 2016 at 9:32 am
  - Jonathan says:
    
    Noise is good in the sense that values that are close will be considered in the same range (due to noise) thereby not carry different information (i.e. will likely not reconstruct very different inputs)
    
    LikeLiked by 1 person
    
    April 10, 2016 at 10:08 am
  - ift6266benjamin says:
    
    Is your question specifically about GSN/DAE?
    More generally, yes, there is such a thing as “good noise”! And that’s the cool/wonderful thing 🙂
    A lot of models can benefit from adding some “good noise”. (not just GSN/DAEs)
    One very impressive example (I think) is the following paper
    http://arxiv.org/abs/1511.06807
    which shows that adding gradient noise in deep networks improves both optimization and regularization.
    
    Quote of Yoshua about this paper:
    “It’s really an optimization trick but it regularizes by preferring large-basin areas (otherwise the noise would kick you out of there).”
    
    LikeLiked by 2 people
    
    April 10, 2016 at 4:56 pm
  - Vincent says:
    
    This is my notes from class specifically on Benjamin’s comment :
    
    The noise in this paper has more to do with optimization procedures (like simulated annealing). Adding noise to the gradient has the effect of regularizing and also helps optimization to get out of local minima.
    
    Noise in GSN DAE has more to do with carving a good objectif function. It help to learn what is important (that is uncorrupted data point).
    
    LikeLiked by 1 person
    
    April 11, 2016 at 2:50 pm
  - Jonathan says:
    
    I remember where I saw the explanation of why noise is good. Check out Karol Gregor’s lecture on VAE (explains it at t=10min)
    
    Basically without noise, we may reconstruct two different x1_r and x2_r from two different inputs x1 and x2. With noise, we create overlap since in the latent variables (before reconstruction) since z1 may equal z2 if z1 = u1 + noise1 and z2= u2 + noise2 are within each other’s noise range. If the latent variables overlap, a single variable x_r will be reconstructed. This therefore compressed data.
    
    LikeLiked by 1 person
    
    April 11, 2016 at 8:31 pm
  - ryanlowe says:
    
    Thanks for the link Jonathan!
    
    Looking at that part in the video, another question comes to me: is the reparametrization trick (setting z = mu + sigma * epsilon) exactly equivalent to the regular sampling of z~Normal(mu,sigma)? I always thought it was, but if epsilon is sampled uniformly then it would look more like a cylindrical shape, no?
    
    LikeLiked by 1 person
    
    April 11, 2016 at 10:38 pm
  - ryanlowe says:
    
    Wow, never mind, epsilon is sampled from a Normal(0,I) as well. Not sure why I thought it was uniform… they should be exactly equivalent then.
    
    LikeLiked by 1 person
    
    April 11, 2016 at 10:40 pm
  - Jonathan says:
    
    That’s correct. Epsilon is sampled from a Normal(0,I).
    
    LikeLiked by 1 person
    
    April 12, 2016 at 10:46 am
assyatrofimov says:

Thanks for all the replies!
I have a much better understanding now

LikeLike

April 12, 2016 at 12:25 pm Reply

	X. Willhem on Lecture 2, Jan. 11, 2016
	Thomas George on Lecture 15, Feb. 25th, 2016: O…
	Vincent on Lecture 22, March 31st, 2016:…
	Vincent on Lecture 22, March 31st, 2016:…
	Jonathan on Lecture 24, April 7th, 2016: V…

IFT6266 H-2016 Deep Learning

Deep Learning, graduate class at U. Montreal

Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)

37 thoughts on “Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)”

Leave a reply to etienneift6266 Cancel reply

Share this:

37 thoughts on “Lecture 24, April 7th, 2016: Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs)”

Leave a reply to etienneift6266 Cancel reply