Lectures

Lecture 2, Jan. 11, 2016

Today we will finish our overview of Machine Learning and dive into a detailed review of Neural Networks.

Please study the following material in preparation for the lecture:

Hugo Larochelle’s video lectures 1.1 to 1.6.
Chapter 6 of the Deep Learning textbook (MLPs) (sections 6.1 and 6.2)

Do not forget to leave questions / comments / answers.

Additional reference material:

Hinton’s coursera lecture 1, videos 1 to 5.

38 thoughts on “Lecture 2, Jan. 11, 2016”

tfjgeorge says:

Questions

Regarding the rectifier activation function:
– when to choose it in place of a sigmoid or a tanh ?
– why use a rectifier and not softplus that looks that a smoothed variant
– why use the standard rectifier instead of a variant such that:
1. a max bounded rectifier
2. a non monotonous rectifier
3. a rectifier that ranges from -1 to 1
4. a linear function -\infty to +\infty with a plateau around 0
I plotted these ideas here: https://github.com/tfjgeorge/ift6266/blob/master/Activation%20functions.ipynb

Regarding biological neurons:
– are there actually tied weights between convolution-like filter neurons in the human brain? This would imply some sort of links between neurons that carry the weights that seem very unlikely to me
– what are the current results in trying to model a realistic biological neuron that receives spikes and respond to a variation in spike frequency ?

LikeLike

January 10, 2016 at 5:41 pm Reply
- Christopher Beckham says:
  
  In regards to your first question, two nice things about ReLUs are that it doesn’t suffer from the gradient saturation issue, since dy/dx = 1 for when x > 0. It’s also cheap to compute. I’m not sure if I’d ever use sigmoid / tanh for hidden layer activations (i.e. not output units) anymore? If I didn’t have ReLUs I would go with tanh over sigmoid since the former has a greater range of “gradient non-saturation” (it’s between -1 and +1, rather than 0 and 1).
  
  LikeLike
  
  January 11, 2016 at 1:04 am Reply
  - Olexa Bilaniuk says:
    
    ReLUs are computationally dirt cheap to compute. Thanks to the property of the IEEE-754 binary representation of floating-point numbers that when they are interpreted as integers and compared the correct result is given, little new FP comparison hardware needs to be added beyond that already present for integer math (ignoring the issue of NaNs). Integer comparisons are essentially an integer subtract, and those cost almost nothing, so floating-point comparisons also cost almost nothing.
    
    ReLUs do suffer from their gradient being 0 and can “die” if initialized or pushed too far into the negative, since training might no longer be able to make the ReLU bias “escape”.
    
    I have an intuition that ReLUs are that good because their sharp edge lets them “carve” clean delimiting cuts; Something is either on the live side and can be amplified, or is on the cut-away side and is totally suppressed. But I might be mistaken.
    
    tanh vs sigmoid is where I disagree. There is almost no difference between them except input scale and output range; In fact they can be defined in terms of each other (https://brenocon.com/blog/2013/10/tanh-is-a-rescaled-logistic-sigmoid-function/). Thus, the “range of gradient non-saturation” needn’t be a problem; The machine learning algorithm will simply learn a different input weight and bias in the layers before and after.
    
    LikeLike
    
    January 11, 2016 at 6:44 am
  - Christopher Beckham says:
    
    @Olexa, thanks for that. I think I mis-read what was being explained here:
    
    http://cs231n.github.io/neural-networks-1/
    
    In that link it mentions that tanh is preferred to sigmoid in practice because the output is zero-centered (rather than 0.5).
    
    LikeLike
    
    January 11, 2016 at 5:12 pm
  - tegan says:
    
    @Olexa @Christopher
    If your weights are [-1,1] though, if you use the sigmoid aren’t you squashing and potentially losing information at each output? I think I had the same read as Christopher of Michael Nielsen’s textbook (http://cs231n.github.io/neural-networks-1/#actfun)
    
    LikeLike
    
    January 11, 2016 at 5:37 pm
  - Olexa Bilaniuk says:
    
    @Tegan Weights are not restricted to [-1, 1] in tanh; Only the activations are. tanh and sigmoids are also not per se destructive of information, despite squashing their input from (-infty, +infty) to (-1, 1) or (0, 1). Given that both are monotonically increasing, they are a one-to-one mapping between the two ranges, and there exists a reciprocal for tanh that decompresses (-1, 1) to (-infty, +infty): It is `0.5*ln ((1+x)/(1-x))`. Another example: exp() squashes (-infty, +infty) to (0, +infty), but surely exp can’t be information-destructive, since x == ln(exp(x)). Compare ReLU(x) = max(0) which squashes (-infty, +infty) to the similar range [0, +infty), yet *is* information-destructive.
    
    LikeLike
    
    January 11, 2016 at 5:50 pm
  - tegan says:
    
    @Olexa
    So do you think it makes sense to say that compression is happening in both cases and because it’s from -inf,+inf to either [0,1] or[ -1,1], the difference of the 0 vs -1 is really small relative to -inf, and this is why there’s not much difference between sigmoid and tanh?
    As far as my own experience, tanh usually converges faster. The explanation given by Michael Neilsen about non-zero-centeredness causing zig-zagging of sigmoid units seemed to explain this for me. Do you have different experiences or know of some results showing that sigmoids and tanh are more equivalent?
    
    LikeLike
    
    January 11, 2016 at 9:42 pm
  - Olexa Bilaniuk says:
    
    @Tegan Tanh and sigmoid have roughly the same behaviour with respect to things like derivatives and loss functions. However, tanh has that nice property of being, in value, symmetric about 0, so I agree with Neilsen that it is preferable.
    
    I think the exact size of the range you squash to is not of any real importance, since the activations will be multiplied by whatever weight is in the next layer, and the learner will simply scale its learned weights accordingly. What’s notionally more important with tanh/sigmoid is that you go from an infinitely-large input range to a bounded output range; You aggregate large sums of almost unrestricted size into a single, bounded measure that can be more easily manipulated and compared.
    
    LikeLike
    
    January 11, 2016 at 10:02 pm
- Olivier Mastropietro says:
  
  @”what are the current results in trying to model a realistic biological neuron that receives spikes and respond to a variation in spike frequency ?”
  
  I do not know about results but you could find a recent paper, An Objective Function for Spike-Timing Dependant Plasticity by Yoshua et al. interesting http://arxiv.org/pdf/1509.05936v1.pdf
  
  LikeLike
  
  January 11, 2016 at 3:35 am Reply
- tegan says:
  
  @TFJGeorge
  Regarding biological neurons:
  – you may want to read more about synapses and neurotransmitters, e,g, http://www.biologyreference.com/Mo-Nu/Neuron.html … NNs are not a direct analogue of biological brain matter, but approximately speaking, I’d say tied weights could be represented by astrocyte-mediated calcium gradients, for example, or other neurotransmitters which affect groups of neurons and influence their excitability (e.g. nitric oxide gas)
  – this is a huge field of research. There are many, many spiking neuron models, many of which are quite accurate for certain cell types and situations. Do you have a more specific interest?
  
  LikeLike
  
  January 11, 2016 at 5:28 pm Reply
assyatrofimov says:

Questions regarding the Feed Forward Networks:
– I understand ReLU is recommended by default as an activation function. Is the activation function seen as a type of hyper-parameter? Should this be optimized when constructing and training the network?

– Context: from what I read, it seems the more precise the cost function is, the better the model will learn; clearly defining the cost seems to be a good way to go. But given the fact that the training is done on a sample of the actual distribution the network is trying to model by generalization, is it possible that defining a very precise cost function will harm the model and lead to overfitting the data?

LikeLike

January 10, 2016 at 11:56 pm Reply
- Julien St-Pierre Fortin says:
  
  From what I have read, MLPs hyper-parameters are not optimized via training algorithms such as stochastic gradient descent. These algorithms optimize model parameters (i.e. weights and biases related to each neuron inputs/outputs), but capacity measures as hyper-parameters (i.e. the width and height of the network) are usually tuned manually.
  
  I have found useful informations about hyper-parameters here: http://deeplearning.net/tutorial/mlp.html#tips-and-tricks-for-training-mlps
  
  LikeLike
  
  January 11, 2016 at 4:22 pm Reply
- julienstpf says:
  
  From what I have read, MLPs hyper-parameters are not optimized via training algorithms such as stochastic gradient descent. These algorithms optimize model parameters (i.e. weights and biases related to each neuron inputs/outputs), but capacity measures as hyper-parameters (i.e. the width and height of the network) are usually tuned manually.
  
  I have found useful informations about hyper-parameters at http://deeplearning.net/tutorial/mlp.html#tips-and-tricks-for-training-mlps
  
  LikeLike
  
  January 11, 2016 at 4:24 pm Reply
- tegan says:
  
  I’m not sure you can say that the function itself is a hyperparameter (since it has parameters too?) but this is just semantics… I think the point about activation function choice influencing the model is valid, and I’d be interested in reading more about the impact of activation function choice on e.g. generalization error/capacity, as you mention, or also about learning the activation function.
  
  LikeLiked by 1 person
  
  January 11, 2016 at 5:20 pm Reply
- Olexa Bilaniuk says:
  
  I certainly see the choice of activation function as a categorical (non-numeric) hyperparameter. After all, just like other knobs that affect the structure of the architecture of the neural network such the # of layers and the choice of their type (convoution, pooling, fully-connected, etc.), the choice of activation function is made a priori, before training.
  
  The training is then performed, and then the best selection of hyperparameters is selected according to the trained model’s performance with that choice of hyperparameters on the validation set.
  
  LikeLiked by 1 person
  
  January 11, 2016 at 5:35 pm Reply
  - julienstpf says:
    
    From what I have read, MLPs hyper-parameters are not optimized via training algorithms such as stochastic gradient descent. These algorithms optimize model parameters (i.e. weights and biases related to each neuron inputs/outputs), but capacity measures as hyper-parameters (i.e. the width and height of the network) are usually tuned manually.
    
    I have found useful informations about hyper-parameters at http://deeplearning.net/tutorial/mlp.html#tips-and-tricks-for-training-mlps
    
    LikeLike
    
    January 11, 2016 at 6:01 pm
- Christopher Beckham says:
  
  I can imagine the activation function being treated as a hyperparameter, but I’ve never done any experiments personally on this, so I don’t know whether that kind of thing would result in big gains. Maybe there are some papers out there testing this out? This looks pretty interesting, but I haven’t read it yet: http://arxiv.org/pdf/1412.6830v3.pdf
  
  Speaking of relus, there is a “leaky relu” (https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) where the alpha value can be tuned (“leakiness factor”).
  
  In regards to your second question, this is where regularisation would come in, where you have an extra term on your loss function that computes e.g. the squared sum of the weights of the model, multiplied by a lambda term (ughh, another hyperparameter!). So what is kinda happening is that you’re constraining the “flexibility” of the model so that it tries not to overfit the data. See http://images.slideplayer.com/26/8753927/slides/slide_10.jpg
  
  LikeLiked by 1 person
  
  January 11, 2016 at 5:37 pm Reply
- Faruk Ahmed says:
  
  For the second part, it might be useful to consider that capacity control (or a prior on parameters) is also included as part of the cost function definition. I think that one would like to be as precise as possible when defining a task loss.
  
  LikeLike
  
  January 11, 2016 at 6:54 pm Reply
  - assyatrofimov says:
    
    Yes, that’s exactly my point. But as Christopher pointed out, regularization should take care of overfitting issues.
    
    LikeLike
    
    January 12, 2016 at 3:39 am
tlesort says:

Hi, 🙂
I have a question about the training of a NN.
Is it possible to give to your NN some kind of pre-trained layers in order to reduce the training time?
I think about image analysis for example, as far as I understood the process the first layer recognizes some basic shapes which are the same whatever the set of images you have.
Instead of training all the layers we could give a pre-trained first layer to the model and only optimize the other one.
Does it make sense?

LikeLiked by 2 people

January 11, 2016 at 3:47 am Reply
- tegan says:
  
  Greedy pre-training, especially for RBMs and stacked autoencoders, is definitely something that is done (see this NIPS paper from a few years ago: http://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf and also this one explaining why it could be a good idea: http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf)
  
  This quora page has some discussion about when it is a good idea to pretrain: https://www.quora.com/When-does-unsupervised-pre-training-improve-classification-accuracy-for-a-deep-neural-network-When-does-it-not
  
  Cheers!
  
  LikeLiked by 1 person
  
  January 11, 2016 at 5:17 pm Reply
- nyfbber says:
  
  Hi, in some “deep” models (which has several hidden layers), for example stacked auto-encoders, they do employ a layer-wise pre-training process, i.e. pre-training each hidden layer one by one and then fine-tune the whole network with traditional back-propagation. After the pre-training, each hidden layer can be seen as a hidden abstract representation (as you said, the basic shapes of images, etc) of the raw input feature vector in the input layer .
  
  LikeLike
  
  January 11, 2016 at 6:54 pm Reply
- assyatrofimov says:
  
  Interesting question!
  My guess is that the model would save time (perform better), since it already “pre-figured out” the lower hierarchy layers. I vote faster convergence!
  But then again, how is it different from training the model on another set of images? I mean, if you are correct, and in fact all image recognition models have the same lower level layers, you do need to train them on the whole thing to obtain the lower layers…. no?
  
  LikeLike
  
  January 11, 2016 at 7:06 pm Reply
  - Christopher Beckham says:
    
    I generally think of unsupervised pre-training as a trick to get your model in a (potentially) better parameter space, rather than a technique to speed up training, and I think I see your confusion here. Perhaps if you pre-trained a neural network (i.e. trained an autoencoder), which you then used it as a “base” to train several different types of (supervised) neural networks, pre-training would make sense as a technique to speed up model training in general since you’re not starting from square one every time.
    
    LikeLike
    
    January 14, 2016 at 12:00 am
julienstpf says:

@ASSYATROFIMOV
From what I have read, MLPs hyper-parameters are not optimized via training algorithms such as stochastic gradient descent. These algorithms optimize model parameters (i.e. weights and biases related to each neuron inputs/outputs), but capacity measures as hyper-parameters (i.e. the width and height of the network) are usually tuned manually.

I have found useful informations about hyper-parameters at http://deeplearning.net/tutorial/mlp.html#tips-and-tricks-for-training-mlps

LikeLike

January 11, 2016 at 6:01 pm Reply
JSS says:

What about the choice of the input data? In the case of an MLP with single hidden layer, the input data needs to be carefully chosen by hand, often “massaging” / preprocessing it using expert knowledge.

In a deep learning network we can send “raw data” and let the model find appropriate representations. But still, what are the stakes in choosing the right kind of “raw data”? I guess there (like for two-layer MLP) there is a tradeoff between the input dimension, the size of the dataset and the capacity of the model but what is different in the case of deep models?

LikeLike

January 11, 2016 at 6:19 pm Reply
- Florian Bordes says:
  
  I think the tradeoff isn’t really different with deep models. It’s only a matter of “What your problem is”. If you goal is to read the digits on a paper, you don’t really need to have a model with a very big capacity because the data you are using will be always the same kind. You can have a deep model that is specialize on facial recognition and training only on faces, in this case you can’t use any kind of “raw data”. If you have a very big dataset with 1000 classes, you will need a very deep model with a capacity corresponding to this dataset. Even in the case of a deep network. there is a relationship between the capacity, the dataset and your goal. A last example, if you have a dataset of natural images, you can have a deep network that can detect if there is a car on this images. However if you want to know what is the brand and the model of the car, in the case of supervised training you’ll have to add some labels on the dataset, maybe add new images and modify the network and his capacity in consequence.
  
  LikeLike
  
  January 17, 2016 at 9:03 pm Reply
Olexa Bilaniuk says:

My questions:

1. Why don’t we use more “exotic” saturating neurons? For instance, a function g(x) defined by taking -1/x, chopping out the region between [-1, 1] and connecting the two parts? This function has the same output range as tanh but has much higher-valued gradients far from the origin (proportional to 1/x^2 instead of exp(-x)), and is thus capable of escaping far more quickly from a given extreme than sigmoid.

2. How does one design an output layer (like softmax) to report not just the believed probability distribution, but also independently its uncertainty about it?

LikeLike

January 11, 2016 at 6:42 pm Reply
Chinna says:

My question :

1) What is special about softmax function ,if any ? if I choose to use let’s say, a 2nd order taylor approximation to e^x = 1 + x + x^2/2 … won’t it work?

LikeLike

January 11, 2016 at 6:53 pm Reply
Yifan Nie says:

Hi, I have a question about the training process of a NN, for a NN, there might be a lot of Hyper-parameters, e.g. # of hidden layer size, # epochs of training, regularization term coefficient (weight decay coefficient), and so on, I have learned from another course IFT6390 that we should determine the best hyper parameter by evaluating the trained model on the validation set, but if we have these 3 hyper-parameters ,should we try every possible combinations of these 3 hyper params (a lot of possible values to test) or is there a faster way to determine the best hyper param setting? Thanks

LikeLiked by 1 person

January 11, 2016 at 7:29 pm Reply
- assyatrofimov says:
  
  Hi Yifan
  From what I gather it is more efficient to select randomly some variables than test out all combinations
  
  Click to access bergstra12a.pdf
  
  LikeLiked by 1 person
  
  January 12, 2016 at 3:33 am Reply
Faruk Ahmed says:

I’m wondering what the relevance of non-identifiability of neural network parameters is from an optimization standpoint.

LikeLike

January 11, 2016 at 7:53 pm Reply
Ph.C. says:

@”what are the current results in trying to model a realistic biological neuron that receives spikes and respond to a variation in spike frequency ?”

Here’s Chris Eliasmith paper discussing deep spiking neural networks ;

http://arxiv.org/abs/1510.08829

LikeLike

January 11, 2016 at 8:30 pm Reply
tegan says:

I mentioned this paper in class and said I would post it – it talks about ReLUs being a summation of sigmoids. http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf

LikeLike

January 11, 2016 at 9:31 pm Reply
tegan says:

We talked in class about wanting nice, predictable gradients – i.e. the steps will have the same behaviour regardless of where you are on the surface. But would there be cases (e.g. if we know something about our output space?) where this uniformity would not be desirable?

LikeLike

January 11, 2016 at 9:33 pm Reply
yiulau says:

I see a potential typo. On page 170 of the course textbook we have

\tilde{p}(y)= exp(yz)

p(y)= \frac{exp(yz)}{sum_{z’=0}^1 exp(z’y)} = \sigma((2y-1)z)).

In the first equation of the second line, in obtaining the normalizing constant, the integration should be done with respect to y, therefore the sum should be changed to
sum_{y’=0}^1 exp(zy’) instead. Left in its current form the second equation is incorrect.

LikeLike

January 11, 2016 at 10:34 pm Reply
- X. Willhem says:
  
  It is y’ in the book now. However, in any case, I don’t see how you obtain \sigma((2y-1)z)) from the left part. Could you detail, please ?
  
  LikeLike
  
  January 7, 2017 at 4:20 am Reply
Faruk Ahmed says:

Anyone have a clear intuition about why softplus performs worse than ReLU on some tasks (Glorot, 2011)? Is it a good thing to be more (piecewise) linear near zero activations?

LikeLike

January 14, 2016 at 12:24 am Reply

	X. Willhem on Lecture 2, Jan. 11, 2016
	Thomas George on Lecture 15, Feb. 25th, 2016: O…
	Vincent on Lecture 22, March 31st, 2016:…
	Vincent on Lecture 22, March 31st, 2016:…
	Jonathan on Lecture 24, April 7th, 2016: V…

IFT6266 H-2016 Deep Learning

Deep Learning, graduate class at U. Montreal

Lecture 2, Jan. 11, 2016

38 thoughts on “Lecture 2, Jan. 11, 2016”

Leave a comment Cancel reply

Share this:

38 thoughts on “Lecture 2, Jan. 11, 2016”

Leave a comment Cancel reply