Lecture 13, Feb. 18th, 2016: Regularization I

In this lecture, we will have a rather detailed discussion of regularization methods and their interpretation.

Please study the following material in preparation for the class:

Slides from class


24 thoughts on “Lecture 13, Feb. 18th, 2016: Regularization I

  1. The first statement in Hinton’s course to counter overfitting is to get more data because given more data, a network is more apt at generalising. This might be nice for machine learning, but it seems we are not getting anywhere near to how humans operate. For example, most humans (and even most babies) are capable of properly classifying an object after seeing it only once. I believe that being more intelligent means either being able to generate better conclusions from a set of information or being able arrive at the right conclusion with fewer information.

    Are there any types of algorithm that are well equipped at replicating this aptitude? If there are, why aren’t we piling all our efforts into those types of algorithms instead of finding clever hacks? Could we somehow introduce some sort of deductive reasoning to help in that regard?

    Liked by 3 people

    • Olivier Mastropietro says:

      To add on this topic, is there a suspected relation or trend between the amount of data that you have and the generalization? For example, do people suspect it is linear or is there insight that would make us believe to more like a tipping point, where generalization would go a little better and better with more data and then passing one point it would significantly get better?


    • I think the comparison is a bit unfair. Even at birth, the human brain already encodes vast amounts of knowledge about the structure of the world, and a baby’s brain benefits from a whole childhood’s worth of video, every waking second of the day.

      For instance, when I see a new object, and then see it again from a different perspective, I’m still able to recognize it because the human brain comes built-in with at least some notion of geometry and how objects’ appearance changes as their relative pose changes. I also benefit from a lifetime of practice in this skill, and can mentally rotate objects, guess their appearance from that new view, and confirm or infirm my guesses as I walk around said object.

      Consider now a generic MLP. It’s a tabula rasa: At initialization time it literally knows Nothing with a capital N. It has no notion of the geometry of our world and we don’t provide it (indeed, how would one even codify such abstract ideas in a neural net?). Is it any surprise neural nets underperform given the same amount of data as a human?

      And that was just geometry. There are so many other laws that govern us, and that nature or nurture have baked knowledge thereof in our brains.

      Liked by 3 people

    • Joshua Tenenbaum and his team published last December (at NIPS and in Science) something I believe could be of interest. The title of their NIPS paper is ; “One-shot learning by inverting a compositional causal process”. I didn’t go through it in details, so I’ll just copy the abstract here ;

      “People can learn a new visual class from just one example, yet machine learning
      algorithms typically require hundreds or thousands of examples to tackle the
      same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We
      evaluated performance on a challenging one-shot classification task, where our
      model achieved a human-level error rate while substantially outperforming two
      deep learning models. We also tested the model on another conceptual task, generating new examples, by using a “visual Turing test” to show that our model
      produces human-like performance.”

      Here’s the NIPS paper ; http://www.cs.toronto.edu/~rsalakhu/papers/lake_nips2013.pdf

      Here’s the Science paper ; http://science.sciencemag.org/content/350/6266/1332.full-text.pdf+html


    • assyatrofimov says:

      I contest what you are saying! 🙂
      I don’t think humans learn via only seeing something once.
      We do something that I believe is called “transfer learning”. Briefly, we use data we have learned about previously and in other contexts to extrapolate a most plausible answer. One key element for humans is long term memory.

      Liked by 1 person

  2. If I understand correctly, a regularizer is a technique which has the effect of worsening performance on the training set and simultaneously improving performance on the validation/test set.

    However, I could see a method working as a regularizer on a small dataset improving both train and test performance when applied to a larger dataset.


    • I don’t think regularization is precisely defined. In one paragraph of the deep learning textbook it says regularization is “strategies to reduce test error, possibly at the expense of increased training error”, and a few paragraphs later says it is defined as “any modification to a learning algorithm intended to reduce generalization error but not training error”.

      Liked by 1 person

    • Related question: when we speak of regularization, do we speak of regularizing the parameters, or the model? For instance, an L2 penalty is placed on the parameters, while we could do something like impose some cost on some function of the activations of a neural network, which would be tied to the training data as well as the parameters. Are there examples of such data-dependent regularizers?

      Liked by 2 people

      • The example given in class was contractive autoencoders, from this 2011 ICML paper: http://www.icml-2011.org/papers/455_icmlpaper.pdf

        Pretty much just as described by @Faruk, they add a term to the cost function which penalizes the derivatives of the activation functions of the hidden layer wrt inputs. They show that computing their penalty is similar to computing the reconstruction error of a denoising autoencoder.

        Another example is this paper http://arxiv.org/pdf/1511.08400.pdf which penalizes the squared difference of the norms between successive hidden states. They call it norm-stabilization, and it gets SOTA on TIMIT … It makes some kind of intuitive sense to me that regularizing the differences between activations is better than regularizing the activations, but I can’t really come up with a good explanation why.

        Would it make sense / has it been tried to apply the norm-stabilization reasoning to stacked autoencoders (penalize the squared difference of norms between layers instead of ‘within’ the layer)?

        Does anyone know of work besides these two regularizers in this area between 2011-now?


  3. In Hinton’s leacture 9a , he said”When the weights are very small, every hidden unit is in its linear range” “As the weights grow, the hidden units start using their non-linear ranges so the capacity grows. ” I do not understand “every hidden unit is in its linear range or in non-linear range”


    • In that video he assumes that the hidden units are sigmoids, and from a sigmoid plot you can see that from ~ -1.5 to +1.5 on the x-axis the y is like a diagonal line. He says that if the weights are very very small then input*weight will also be very small, so that hidden layer will be as if you had no nonlinearity function in that layer as well, which then defeats its purpose.

      Liked by 1 person

    • If you are using a sigmoid \frac{1}{1+\exp(-x)}, the neighbourhood very close to x=0 looks a lot like \frac{1}{4}x + \frac{1}{2}. If very low weights choke the pre-activations too much, they’ll make the sigmoid operate only within this close-to-x = 0 neighbourhood, and so the neuron will behave like a \frac{1}{4}x + \frac{1}{2} linear neuron even though it’s a sigmoid. Now consider if all sigmoids in a layer k behaved linearly. Then the pre-activations of layer k+1 would just be a matrix-multiply by W_{k+1} of a matrix-multiply by W_{k} of the inputs of layer k, and that’s the same thing as replacing both layers with just a single layer whose weight matrix is the matrix multiply of the weights of the old layers: W_\textrm{fused} = (W_{k+1}W_{k}). That means you’re paying for 2 linear + 1 nonlinear layers worth of training time and parameters, but you’re getting only 1 linear + 0 nonlinear layers – a rip-off.

      This collapse will always happen whenever you have consecutive linear or linear-acting layers. This is why it’s important to 1) Always alternate linear with nonlinear and 2) Make sure non-linear actually acts like non-linear.

      If the weights are made bigger, the pre-activations will be bigger, and the non-linear “squash” effect of the sigmoid becomes more obvious. A good sigmoid neuron should have weights that allow the output to ride on its non-linear bends reasonably often; Then the rip-off doesn’t happen.


  4. Florian Bordes says:

    About regularization with Sparse Representations, we place the penalty on the activations of the units instead of the model parameters. In consequence, we’ll encouraging the network to have sparse activations (with the L1 norm), how this will affect the model parameters ? Does it make sense to combine this kind of regularization with weight decay ? Are there any case where having this penalty on the activations are more efficient than on the model parameters ?


    • Florian Bordes says:

      To add some comments about the discussion in class : On sigmoid units, the weights will become large or the bias very small. The penalty on activation and L1 penalization was used in unsupervised learning but it was less used in supervised learning.


  5. In the second video for lecture 9 from Hinton’s course, he compares weight penalties and weight constraints (see last slide of the video). He left some of the details out so I am trying to fill them in here.


    First, he said “we usually penalize each weight separately”. Does this mean instead of minimizing the penalized cost function

    Cost + p || w||^2

    where w are all the parameters of the model, we minimize

    Cost + p1 || w1||^2 +p2||w2||^2 + … + pk||wk||^2

    where the w’s are k different sets of parameters and we want to penalize each set (potentially) differently? For example, w1 can be all the weights in the first layer, w2 the weights in the second layer and so on.


    Second, he introduced a form of constraining weights that are different from the constrained optimization via lagrange multiplier that I am familiar with, which is equivalent to adding a penalty(s) to the cost function as we did in part 1. He suggests to “put a constraint on the maximum squared length of the incoming weight vector of each unit .If an update violates this constraint we scale down the vector of incoming weights to the allowed length ”

    Is this in the ball park of what he meant?

    Here, we have a hyperparameter M > 0 and we constrain ||w|| <= M, but instead of doing constrained optimization we do

    Step 1. One step of gradient descent computed using the original cost function to update w

    Step 2. check to see if ||w|| <=M. If the constraint is satisfied, no action is taken. Otherwise, rescale w to so that the norm is less than M again.

    Repeat until convergence.

    In his video, no mention is made on by how much we should rescale w with respect to its maximum allowed length. Does this introduce another hyper parameter we need to learn ? Or we just set it to something arbitrary like 0.8 ?

    Also, is there any advantage to this way of doing constrained optimization over the conventional way of converting the constrained problem to an unconstrained problem via the lagrange multiplier and minimize the cost function plus a penalty associated with the constraint ?

    The next question is about adding noise in the activities as a regularizer


    Here I am talking about the method introduced in the third video of Hinton’s lecture 9.

    Suppose we have an MLP with logistic units. We are told to “make the (logistic) units binary and stochastic on the forward pass, but do the backward pass as if we had done the forward pass properly ”

    It is claimed that “it does worse on the training set and trains significantly slower”. Any regularization makes performance worse on the training set, but why would the training be slower (and significantly so !) as well?

    Also, what differentiates adding noise to the activities from adding noise to the input from the conceptual point of view ? And why discretize the activities instead of adding some noise to them (making sure they stay in [0,1] while doing it) as we would do for the inputs?


    • Hokay, here goes!

      This is the discussion we had in class:

      [weight penalization / norms]

      “Penalizing each weight separately” does not mean parameterizing the weights separately. The 2 norm naturally just pushes each weight down independently of others:

      \lambda||w||^2 = \lambda\sum_{i=1} w_i^2

      Notice this equation only has the summation in terms of w_i ; there is no dependence between w_i and w_j .

      Compare this to a situation like the L1,2 norm:

      \lambda||w||^{1,2} = \lambda\sum_{i} \sqrt{\sum_{j} w_{ij}^2}

      This introduces group behaviour – as soon as one value is “off”, the cost for all others to be “on” goes up.

      [constrained optimization]

      2a) Yes, the algorithm you describe for checking if an update violates a constraint is correct.
      2b) You would scale by the norm and multiply by the bound. I don’t believe that this is learned, so yes it would add a hyperparameter as far as I can see.
      2c) We’re doing stochastic gradient descent, so there isn’t an obvious way to do this with a lagrange multiplier (LM). When you do gradient descent with LM, you’re trying to minimize over a cost while simultaneously maximizing over LM parameters. This is trickier than just minimizing over the cost and then projecting on the constraint to check it. To support/expand on this, I found this stackoverflow response helpful: “The problem is that when using Lagrange multipliers, the critical points don’t occur at local minima of the Lagrangian – they occur at saddle points instead. Since the gradient descent algorithm is designed to find local minima, it fails to converge when you give it a problem with constraints” http://tinyurl.com/gsoqhjj

      [noise as a regularizer]

      3a) An MLP with binary stochastic units trains slower for three reasons (that we discussed):

      3a i) You’re adding stochasticity to the direction of the update
      3a ii) You’re quantizing things to 0,1, so you’re lising information, and therefore have reduced capacity
      3a iii) The true gradient of binary stochastic units w.r.t. activations is 0, so really, we should not be able to do anything (there should be no weight updates). But we just use the gradient as though the unit had been a sigmoid, and it just seems to work pretty well.

      (latex is a bit long to put in commments, so I put a more detailed version of these explanations with equations in a blog post here: https://teganmaharaj.wordpress.com/2016/02/23/why-binary-stochastic-units-train-slower/)


  6. From the text book: “In practice, an overly complex model family does not necessarily include the target function or the true data generating process, or even a close approximation of either. We almost never have access to the true data generating process sowe can never know for sure if the model family being estimated includes the generating process or not.”

    In this case, any overly complex model is still high bias taking into consideration the true data generating process? Having said that, data augmentation would be better (not always feasible) than regularization in a sense that the former is somehow limiting our model?


  7. From Hinton’s lecture 9f, can you comment on the weight penalty selection method invented by MacKay? It’s not exactly clear to me why it works. What are the downsides? If it’s so good, why don’t we hear more about it?


    • julienstpf says:

      I think it somehow makes the regularizing term \lambda learnable because it updates iteratively prior variances of p(t|w,x) and p(w) instead of giving it a fixed values (let’s say a ratio of \lambda =0.01).


  8. This is not a question per se, I’m just writing an answer to the question I asked in today’s lecture. I was curious as to what the effect of L2 regularisation was on convergence time, since I vaguely recalled reading a report on the net (must’ve been a Kaggle competition), where they said that adding L2 helped with convergence and was most likely due to its addition flattening out local minima and therefore making it easy to jump out of in training (because of SGD). One of the papers that Yoshua referred to is “tricks of the trade” by Yann LeCun: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

    Paraphrasing page 32 of that paper, you want the ratio between the smallest and largest eigenvalues of the Hessian to be as small as possible, as a larger ratio will give you an error surface that is more “taco shaped” (see figure 21). I suppose this is because we assume during gradient descent that we’re making very small (infinitesimal) steps and if the loss surface is too steep we run into trouble.

    I remember the Hessian/eigenvalue thing being mentioned in one or two previous lectures but it’s good to hear about it again. 🙂


  9. Bonjour/Hi,
    I have a question about early stopping mentioned in Hinton’s lecture video. It explains why early stopping will work. He says that when training begins, the weights are initialized into very small values, at the beginning phase of the training, the activations are in the linear ranges of the activations functions such as sigmoid. So the whole net has very small capacity and even acts as a linear model. Then the weights will grow, and the activations are no longer in linear range of the activation functions. So early stopping will help to limit the capacity of the model can prevent over fitting. I’m very curious: is this the reason why most of the activation functions we use have a linear range near 0? And for activation functions as RELU, all range >0 are linear, so even if the weights grow big, the capacity is still not too big to have overfitting, so does this make RELU more popular as activation function?


    • Transcribing the answer for today’s lecture, it is not the reason why most activation functions have a linear range near 0. For ReLU, if you assume a multi-layer MLP where each layer’s activation is a ReLU and the final layer is a softmax, then the greater the weight matrices W1, W2, etc. are (and therefore the greater the ReLU activation), the more inflated p(y|x) (the softmax output) is going to be for relatively large probabilities, meaning it is going to be more confident about a prediction and therefore overfit. So even for a ReLU, you want to constrain the weights if doing so helps you mitigate overfitting.


  10. Pingback: Why binary stochastic units train slower | Learning deep learning

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s