Lectures

Lecture 9, Feb. 4th, 2016: Convolutional Neural Networks II

In this lecture, we will conclude our discussion of the convolutional neural network.

In addition to the material listed for the previous class (Lecture 8), please study the following material in preparation for the class:

We will also be following up our discussion of the material of the last lecture on Convolutional Neural Networks I.

Standard

28 thoughts on “Lecture 9, Feb. 4th, 2016: Convolutional Neural Networks II

  1. Thomas George says:

    In the AlexNet paper [1] they found that using weight decay (a very small value of 0.0005) not only reduces overfitting but also improves the learning ability of the model, namely the training error decreases faster. Is this a general result ? It seems strange that reducing the capacity of a model could increase its learning efficiency on the training set.

    My guess is that it acts as a regularizer not only between the training set and a test set, but also between different batches of the training set.

    [1] http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

    Like

    • I think this refers to the notion that CNNs make strong prior assumptions about the (spatial) locality of features, and reap the savings by requiring far fewer parameters than would have otherwise been needed. The reuse of parameters from location to location and the subsampling massively reduce the dimensionality of your vector of parameters.

      Like

    • As Yoshua explained in class, this observation actually goes beyond ConvNets to all deep networks.

      There are (at least) 2 main priors that we are implicitly assuming when we build and train deep networks:
      1) distributed representations;
      and 2) depth.

      Point 1) can be best understood through the factors of variation in observed data. That is, given some input data points, there are several “high-level dimensions” on which these data points can vary. The example given in class was glasses vs. no glasses, child vs. not a child, and shoes vs. no shoes in images. The key assumption in deep networks is that each of these factors is more or less independent, and each example constitutes a combination of values of these factors (in this case, I think the ‘values of the factors of variation’ can be interpreted as the activations for the neurons in your network). Thus, you can learn about these combinations without having an example for every single possible combination (n factors = 2^n examples). This is where the ‘curse of dimensionality’ part comes into play.

      2) refers to the additional fact that higher-level factors can be combined hierarchically, by composing several lower-level factors. One image-related example I can think of is gender: this could be considered a factor of variation, that could depend on a number of ‘lower-level’ factors, such as length of hair, facial structure, etc.

      This should be contrasted with algorithms such as kernel machines, which make smoothness assumptions but don’t make any assumptions about the distributed nature or compositionality of representations.

      It just so happens that these assumptions are very good ones when it comes to real-life data that we work with. And this can be seen as one of the main reasons why deep learning has achieved so much recently.

      Liked by 2 people

  2. Sur la page «getting started» c’est écrit vers la fin «During training, you probably want to split up the sequence in smaller subsequences e.g. of length N, to avoid running out of memory. To do this you will need to implement a Fuel transformer.» mais ne devrait-on pas plutôt implémenter un Fuel IterationScheme?

    Like

    • bartvanmerrienboer says:

      An iteration scheme alone isn’t enough. An iteration scheme determines which examples are returned e.g. if I have 10 examples, the iteration scheme can choose examples 2, 5 and 7 to return a batch of size 3.

      In the case of sequence data the situation is slightly different; we technically have only 1 example (1 audio track), which has many time steps. When chopping this up into smaller subsequences we don’t just need to choose an example index, we also need to choose the length of the subsequence, and choose what the target is; you could try to predict T + 1 from [1, T] (n-gram approach) or you could predict [2, T + 1] from [1, T] (like an RNN).

      I will try to write some code to do this in Fuel soon; it’s functionality that has been missing apparently.

      Liked by 1 person

  3. Could we use the feedforward/backprop method to generate content? I.e. to learn the input which minimise the cost on a previously trained network where the parameters (and hyperparameters) are assumed to be constant. This would be something like finding out the most “dog” (or “cat”) image in the dogs vs cats challenge.

    Also, could doing so help diagnose the network? Say if the most “dog” image is very similar to a training example, we probably overfittted and if it doesn’t look anything like a dog we MAYBE underfiiting (this one isn’t as clear). Also, starting from various inputs (e.g. random noise) do we get the about the same result? If so our model has few local minima, which might not be good in a generative model.

    I guess this would be a very computationnaly expensive vs normal generative model. Am I wrong?

    Like

    • Yes; This is doable by backprop with respect to the input rather than parameters. In fact you can even make the network predict with high confidence a class, when in fact it’s complete garbage. See this paper for instance: http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Nguyen_Deep_Neural_Networks_2015_CVPR_paper.pdf

      See also DeepDream for psychedelic image generation. Google’s blog post on this is here: http://googleresearch.blogspot.ch/2015/06/inceptionism-going-deeper-into-neural.html

      Like

    • -If you start with random noise, and optimize the network to match the values of hidden layers (perhaps computed for a specific image), then you’ll basically reproduce that image if you use lower level layers and you’ll get nonsense if you use just the higher level layers (too many degrees of freedom, too much information given up by the lower layers).

      -If you start with an existing image and try to match higher level hidden layers, you get “DeepDream”.

      -An exciting alternative is to match a stationary summary statistic of the hidden layers rather than matching the existing values. This is used by “DeepStyle” and “Texture Synthesis” (from last year’s NIPS).

      -This is much slower than sampling from a normal generative model because you have to run an iterative optimization procedure every single time that you want to “generate”/”predict”.

      Like

    • Yes, it can be used to diagnose a NN. It can even be used to improve the generalization of a network. The idea is to generate adversarial examples. If they look very much like the input, the network isn’t very robust. We can take these examples to further train the network to make it more robust.

      Like

    • “I guess this would be a very computationnaly expensive vs normal generative model. Am I wrong?”

      It would be more expensive on a per-example basis, because you’d need multiple iterations for each generation.

      “This would be something like finding out the most “dog” (or “cat”) image in the dogs vs cats challenge.”

      I think that this doesn’t work, because there are so many x values that generate high values for p(y = cat | x) . To many degrees of freedom. What does work is maximizing the L2-norm of the hidden units on a certain layer, which ends up introducing objects from certain classes. This is called DeepDream.

      Maybe you could do something that maximizes p(y = cat | x) while simultaneously maximizing the probability of the generated x being an actual image? Has anyone done this?

      Like

  4. Vincent says:

    It is suggested that we apply small distortion to the image while training a CNN classifier to get a better generalisation and more robust network. Considering the invariance of CNNs, isn’t it that distortion just a way to have the neurons on the same layer extract similar features? (a bit like maxpool/dropout does model averaging)
    Is dropout effictive in RNNs? If so, is there a special way to apply it or the regular procedure is enough?

    Like

    • henry says:

      About the dropout for RNNs, rnnDrop has been proposed. The idea is simple: applying the same dropout mask for the whole sequence rather than different dropout mask for different time frames. I guess that rnnDrop shares some property with spatial dropout (dropping some feature maps in CNNs rather than dropping some activations) in a sense that both methods keep the structure of input (temporal or spatial) after dropout.
      Here is the link for rnnDrop (in the reference, you can find other dropout methods for RNNs): http://www.stat.berkeley.edu/~tsmoon/files/Conference/asru2015.pdf
      And the link for spatial dropout is here: http://arxiv.org/pdf/1411.4280.pdf
      -henry

      Like

    • Vincent says:

      Here are quick recap of M.Bengio answer’s :

      For the first questions : No. The distortions arent random distortions; they capture something that should be invariant (even rotated or translated, a cat is a cat). You can also add noise to a image and train to reinforce prior.

      For the second question : A recent approach is using the same dropout mask for every timesteps. Someone should be posting a link to an article with that approach.

      Like

  5. Bonjour/Hello,
    I have one question about the slides of the last lecture, on page 22 of the slide of CNN, there’s a big figure of the architecture of a CNN example, the last layer after pooling and subsampling has only units of size 11. When we design an architecture of a CNN, should we always arrive at size 11 after the last pooling and sub sampling or that choice depends on different applications and domains? Thanks

    Like

      • Vincent says:

        I believe that wordpress comment are interpreted with Markdown syntax. This would explain why the star symbol is interpreted as italic.

        Just for fun, i’ll try to do a table in Markdown and see if it gets interpreted :

        A
        B
        C

        left
        centered
        right

        Like

    • Answer to this question:
      Yes it depends on applications. If we want to do classification, each input instance corresponds to one class, then the last layer after pooling/subsampling should be 1×1, however if we want to do recognition or detection (we don’t know where the part to be detected is located in advance…) we should increase the size of the last layer after pooling.

      Like

    • This is an issue that I understand well in practice, but that I’d really like to understand more formally through statistics.

      Suppose you’re training an RNN to generate from a sequence (like words in a sentence). At each time step, your RNN takes the word x[t – 1] as an input and produces p(x[t]) as an output. At train time, a standard method is to use “teacher forcing”, which means that the input for the RNN is always the observed “x[t – 1]”. In practice, the issue with using the model’s generated “x[t – 1]” as the input is that it can lead to training the model to condition on very unnatural input sequences which makes training harder. Samy Bengio recently proposed a “Scheduled Sampling” which starts by doing teacher forcing (i.e. feeding in observed inputs) and slowly switching over to using the model’s samples as inputs.

      I think that there’s a simple proof by induction which establishes that BPTT with teacher forcing is statistically consistent but there are also simple examples where it has high finite sample bias.

      Like

  6. Vincent says:

    If we have time in class today, could we see a quick example of LSTM? I am not sure to understand how/where is the mask computed and the ”peephole connection” variant.

    Like

  7. Bonjour,

    Ma question concerne la reconnaissance video et d’image présenter dans la conférence NIPS. Je me demandais si l’architecture d’un réseau qui analyse une vidéo est très différente pour prendre en compte le dynamisme des inputs ou nous avons simplement besoin d’une capacité de calculs plus grande pour prendre le concept de dynamisme.

    Ex: dans la video NIPS, la reconnaissance en temps réel des piétons.

    Like

  8. Geoff Hinton, in a Reddit AMA (https://www.reddit.com/r/machinelearning/comments/2lmo0l/ama_geoffrey_hinton), said:

    “The pooling operation used in vonvolutional neural networks is a big mistake and the fact that it works so well is a disaster.

    If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its true that if the pools overlap enough, the positions of the features will be accurately preserved by ‘coarse coding’. But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer.

    I think it makes much more sense to represent pose as a small matrix that converts a vector of positional coordinates relative to the viewer into positional coordinates relative to the shape itself. This is what they do in computer graphics. It explains why you can’t see a shape without imposing a rectangular coordinate frame on it, and if you impose a different frame, you can’t recognize it as the same shape. Convnets have no explanation for that.”

    Do you agree with this? Most of these criticisms seem to be derived from Geoff’s desire for biological plausibility. Are there any other theories for a more biologically plausible pooling operation in ConvNets?

    Liked by 1 person

    • Melvin Wong says:

      I agree with Hinton on this. My theory is that overlapping is a very crude way of managing spatial information.

      Humans recognize images by something called the gestalt effect where objects are perceived as a whole, independent of its parts. In conv nets, grid-like search may not be able to detect the object’s structural information and neural nets assume that the image is a collection of its parts, which is biologically incorrect as to mimic humans’ way of recognizing images.

      An example, say you have an image of a cat and an image of a cat’s shadow. A NN may classify both images as cats, although with a lower confidence on the cat’s shadow, but we humans will know immediately that the cat’s shadow image is a shadow, not a cat. It may have features and shapes of a cat (ears, tails etc..) but we ignore its parts and perceive the image as a whole.

      Like

      • Can you clarify how the notion of gestalt is inconsistent with Convnets? It’s true that convnets go from low level features to high level features, whereas humans only consciously think of high level attributes.

        However it’s possible that the brain also does this low-level feature processing but that the mind isn’t consciously aware of it.

        But this does raise an interesting question. If a person sees an ambiguous object (like the duck-rabbit or a necker cube), what is happening in the brain and the visual processing of the image? Are low-level features still interpreted the same way or are they re-interpreted? If the latter, then this is quite different from how a convnet works. One explanation is that the brain is doing some kind of MCMC over a distribution over theories for what the object is (i.e. duck vs. rabbit), but Yann LeCun has argued that the consistent timing in theory-switching rules out this explanation.

        Like

      • Melvin Wong says:

        @thenuttynetter
        Our mind can fill in gaps in images where nothing is actually present. This is our ‘low-level’ feature processing power.

        Yes, convnet does look at low and high level features, but mimiking human perception of object through gestalt is still not possible (from what i know).

        Like

Leave a comment