In this lecture we will discuss two modern generative models, the variational auto-encoders (VAEs) and the generative adversarial networks (GANs).
Please study the following material in preparation for the class:
- Part of Chapter 20 (sec. 20.9 to 20.10) of the Deep Learning Textbook (deep generative models).
- Slides on deep generative modeling (26 to 41)
I have a question about convolutional neural network training, most cnn train with the output of 11 output map, but there are also 22 output map, does it influence the training performance? (Like described in overfeatnet paper figure 5 http://arxiv.org/pdf/1312.6229.pdf)
LikeLike
11 i mean 1 by 1, 22 i mean 2 by 2
LikeLike
The result will depend on your architecture in the two cases 11 or 22. If simply you are changing the size of the final kernel to yield a bigger output map, then yes it will result in less operations, but not by much.
LikeLike
As said in class today, the architecture should depend on what you want to do. If the final map of the convolutional part of your network is of size 1by1, then it means that the resulting features have been influenced by the entire input image (ie “global” features). For a task such as classification, I think it is more usual to use global features. But for another task, it might be useful to have a bigger map after the last convolutional layer (detection maybe?).
LikeLike
Hi Danlan,
Actually if you are doing classifications, you will have to use the last conv layer feature maps as features, and feed them into the last layer MLP to do classifications. so it’s better to set to . And those features are highly abstract due to the depth of the CNN , which might help the performance. If you are doing other tasks such as picture part recognition, you may want to set the last layer feature maps to a larger size.
LikeLike
This is a question to the class : Has anyone tried using a DCGAN for the vocal synthesis project ?
LikeLike
Alex Lamb and I are working on it. Our generator comprises of LSTM layers, and discriminator contains a mix of convolutional and GRU layers.
LikeLiked by 1 person
*LSTM layers.
LikeLike
Cool! Do you guys have any code up for it?
LikeLiked by 1 person
Very interesting
LikeLike
VAE uses a latent variable z which as some prior p(z) (often a normalized Gaussian) but I don’t see why this prior should hold when we infer z from x. In fact, I think those two distributions (the true prior and the distribution we get when inferring from x) should be different because no term in the cost is used to impose that constrain. The divergence term seems to do it but only to some extends. Finally when generating we sample from the prior of z which end up being not the real distribution of z and might cause some problems. Am I right?
LikeLiked by 1 person
Could you train a better discriminator in a GAN by giving the discriminator noised/blurred xs and pure noise (i.e. amount of noise could be the target value), in addition to generated xs, that it has to distinguish from real xs? You wouldn’t backprop through the noisy xs to the generator, but I still think it could force the generator to be better, because if it generated an example that looked like a blurry x, the discriminator wouldn’t be fooled, and you’d maybe get a gradient that would encourage sharper/less noisy images. Has this been tried?
LikeLike
If I understand correctly, you’re saying that, instead of training a discriminator on two classes: real vs. fake, you train a discriminator on three classes:
“real vs. fake vs. blurry versions of real images”.
The generator will still be optimized to make the discriminator think that its fake images are real, but the discriminator is trained to classify between the 3-classes?
LikeLike
To summarize what Yoshua said in class: this could work, but it’s possible that the discriminator finds some hack to determine the amount of noise, and that is not very useful to train the generator.
My intuition is along similar lines: I don’t see how this setup would allow the generator to capture any global structure of the image. For example, if you add Gaussian pixel noise, the discriminator can simply look at small local correlations between pixels to predict the amount of noise. So the generator just has to come up with something that’s ‘fairly smooth’ – it doesn’t actually have to produce things that look like the training data.
LikeLike
Also, it was pointed out that depending on the type of noise, the discriminator could use some very simple hacks (eg. the mean or variance of the pixels) to figure out if it’s a real or fake image
LikeLike
Actually, this gets me thinking – is there a form of parametric noise that could be added that could lead to the discriminator having to account for global structure to properly calculate it? For example, some kind of rotation of a random circular crop somewhere in the image?
LikeLike
Okay, thinking about it even more (apologies for the multi-post), I’m really skeptical that any of these possibilities would do better than the original GAN.
The best possibility that I can think of is to augment the usual GAN with an extra output on the discriminator – so, you are predicting both whether the input image comes from the model or training data, and also how much noise is in the image (you could also add noise to the model generated images). This could maybe act as some form of regularizer or something (but that’s a big maybe)
LikeLike
This could probably in a DCGAN. Most DCGANs produce images from a random vector which is generated on the fly. Our situation is slightly different – we are given a grainy image, and would like the generator to produce an image that not only fools our discriminator, but also looks like the original grainy image to then be fed to another discriminator. We replace the first generator basically with an autoencoder-like structure. The grainy image goes in one end, goes through a series of convolutional features, then is reconstructed with the upscaling structure used in DCGAN. We then downsample the generated image and compare it to the grainy version.. Discriminator 2 would then compare that last image with the clean picture.
LikeLike
I do not understand in the book chapter 20.2.1, how equation 20.14 comes to equation 20.15?
LikeLike
You start with an equation for an individual hidden unit (eq. 20.14), and multiply over all the hidden states to get the equation for the full hidden layer, hence the product over j. The 2h – 1 for a hidden of 1 will be 1, and a hidden of 0 will be -1. Using the sigmoid identity: sigmoid (-x) = 1 – sigmoid (x) this is essentially syntactic sugar to write the probability for h = 1 or 0 as a single expression. Finally the tensor multiplication is due to the fact that all states are being considered, not a single h_j.
LikeLike
I’ve seen in some article that it is possible to force a VAE to learn specific features in its latent variables (e.g. the angle of an object). I think it would be interesting to go over how this is done in class. Also, could other methods use the same trick to learn specific latent variables?
LikeLiked by 1 person
Here is an example of what I was talking about :
Click to access 5851-deep-convolutional-inverse-graphics-network.pdf
LikeLike
Its an interesting trick. Maybe another way to get a similar result would be to truncate the network, learn a vector of targets for the latents, corresponding the features we want to learn, and the reconnect the upper layers and resume training. Not sure how I would actually make that work in practice.
LikeLike
Why is the VAE defined for arbitrary computational graphs? Is this still true in the presence of latent variables with intractable posterior distributions?
LikeLike
Slide 26 explains it well. VAE is a sort of generative black box. Before we were trying to approximate data generation using p(x) that we tried to parametrize. VAE parametrizes a black box (aka “machine”) which approximates data generation. Since the former is not really dependent on p(x), VAE makes data generation applicable to a larger range of probabilistic model families (even in the presence of latent variables that are intractable in the posterior distributions)
LikeLike
The textbook describes in words that in VAEs, “Learning then consists solely of maximizing with respect to the parameters of the encoder and decoder. All of the expectations in may be approximated by Monte Carlo sampling.”
Could you please write down the mathematical expression for the loss we’re trying to minimize through gradient descent, including in terms of ? I expect it will look something like
given that we’re using Monte Carlo sampling.
LikeLike
Help!
Could you please explain how GSNs / DAEs add noise?
I’m not sure how this affects performance in the end. Isn’t that taken care of higher up?
LikeLike
Correction to my question:
Is there such a thing as “good noise”? Is noise really helping?
LikeLike
I found a pretty good explanation from Kyunghyun Cho, January 2013, “Boltzmann Machines and Denoising Autoencoders for Image Denoising”:
“Unlike an ordinary autoencoder, a DAE explicitly sets some of the components of an input vector randomly to zero during learning […]. [This] adds adds noise to an input vector. It is usual to combine two different types of noise, which are additive isotropic Gaussian noise and masking noise [Vincent et al., 2010]. The first type adds a zero-mean Gaussian noise to each input component, while the masking noise sets a set of randomly chosen input components to zeros. Then, the DAE is trained to denoise the corrupted input.”
LikeLiked by 2 people
Noise is good in the sense that values that are close will be considered in the same range (due to noise) thereby not carry different information (i.e. will likely not reconstruct very different inputs)
LikeLiked by 1 person
Is your question specifically about GSN/DAE?
More generally, yes, there is such a thing as “good noise”! And that’s the cool/wonderful thing 🙂
A lot of models can benefit from adding some “good noise”. (not just GSN/DAEs)
One very impressive example (I think) is the following paper
http://arxiv.org/abs/1511.06807
which shows that adding gradient noise in deep networks improves both optimization and regularization.
Quote of Yoshua about this paper:
“It’s really an optimization trick but it regularizes by preferring large-basin areas (otherwise the noise would kick you out of there).”
LikeLiked by 2 people
This is my notes from class specifically on Benjamin’s comment :
The noise in this paper has more to do with optimization procedures (like simulated annealing). Adding noise to the gradient has the effect of regularizing and also helps optimization to get out of local minima.
Noise in GSN DAE has more to do with carving a good objectif function. It help to learn what is important (that is uncorrupted data point).
LikeLiked by 1 person
I remember where I saw the explanation of why noise is good. Check out Karol Gregor’s lecture on VAE (explains it at t=10min)
Basically without noise, we may reconstruct two different x1_r and x2_r from two different inputs x1 and x2. With noise, we create overlap since in the latent variables (before reconstruction) since z1 may equal z2 if z1 = u1 + noise1 and z2= u2 + noise2 are within each other’s noise range. If the latent variables overlap, a single variable x_r will be reconstructed. This therefore compressed data.
LikeLiked by 1 person
Thanks for the link Jonathan!
Looking at that part in the video, another question comes to me: is the reparametrization trick (setting z = mu + sigma * epsilon) exactly equivalent to the regular sampling of z~Normal(mu,sigma)? I always thought it was, but if epsilon is sampled uniformly then it would look more like a cylindrical shape, no?
LikeLiked by 1 person
Wow, never mind, epsilon is sampled from a Normal(0,I) as well. Not sure why I thought it was uniform… they should be exactly equivalent then.
LikeLiked by 1 person
That’s correct. Epsilon is sampled from a Normal(0,I).
LikeLiked by 1 person
Thanks for all the replies!
I have a much better understanding now
LikeLike