Questions and answers

You can submit your questions regarding Theano, Blocks and Fuel, and the class project, as well as general revision questions in the lead-up to the exam, on this page.

Similarly to the Q&A style in the rest of the course, please help out your fellow students if you know the answer to their questions.


63 thoughts on “Questions and answers

  1. When I am trying to follow this instruction, installing the data set, I do not know what do you mean “in order for Fuel to know where to look for its data, the data_path configuration variable has to be set inside ~/.fuelrc. It’s expected to be a sequence of paths separated by an OS-specific delimiter (: for Linux and OSX, ; for Windows):

    # ~/.fuelrc
    data_path: “/first/path/to/my/data:/second/path/to/my/data”

    should i create a file called .fuelrc and have this content on that file?


  2. Then I followed this :”cd $HOME
    mkdir fuel_data # Create a directory in which Fuel can store its data
    echo “data_path: \”$HOME/fuel_data\”” > ~/.fuelrc # Create the Fuel configuration file
    cd fuel_data # Go to the data directory” I do not why i can not use fuel-download command. It seems that I did not install fuel successfully, is there a way to uninstall fuel?


  3. Florian Bordes says:

    In you have install fuel with pip, you have the bin file in ~/.local/bin/. So you can add this folder to your path or run fuel download with ~/.local/bin/fuel-download.
    It’s not mandatory to have a .fuelrc, if you haven’t one, fuel-download will simply download the files in the current folder.


    • bartvanmerrienboer says:

      Just to clarify: The Fuel scripts are only in ~/.local/bin if you’re not using Anaconda; otherwise the file is most likely in ~/anaconda2/bin or ~/anaconda3/bin.

      Secondly, fuel-download will always download the file to the current working directory. Likewise, fuel-convert will read and write files to the current working directory by default. The data path in ~/.fuelrc is used is when you load the dataset in Python (e.g. train = DogsVsCats(('train',))), so you still need to set it.


      • Florian Bordes says:

        Thanks for the explanation, I have a question about the hd5 file. I am using fuel-convert and after 40% I got a file of 4.3Gb So, I suppose the final file will be bigger than 10Gb, is it normal ? Isn’t it a way to reduce that file ? (Maybe using another datatype for the numpy array ?)


      • bartvanmerrienboer says:

        Yes, the final file is about 16GB. There’s no easy way to reduce that file size, the images are already stored using unsigned 8-bit integers, which is the smallest data type possible. The reason it’s so big compared to the original files is that the data is stored without compression, while the original files were JPEG files. (You could store the data in JPEG, but you would have to decompress each batch of images during training, which could be slow.) If you don’t have 16GB of memory, don’t worry. HDF5 is pretty good at reading things from disk for large files and caching intelligently.

        Liked by 1 person

  4. I do not why I get this error when I use fuel-download
    home/2014/dchen53/fuel_data/ 99% [] ETA: 0:00:01 2.3/home/2014/dchen53/fuel_data/ 99% [] ETA: 0:00:00 2.3 MiB/s
    Traceback (most recent call last):
    File “/home/2014/dchen53/anaconda2/bin/fuel-download”, line 9, in
    load_entry_point(‘fuel==0.1.1’, ‘console_scripts’, ‘fuel-download’)()
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/fuel/bin/”, line 62, in main
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/fuel/downloaders/”, line 142, in default_downloader
    download(url, file_handle)
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/fuel/downloaders/”, line 68, in download
    r = requests.get(url, stream=True)
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 69, in get
    return request(‘get’, url, params=params, **kwargs)
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 50, in request
    response = session.request(method=method, url=url, **kwargs)
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 468, in request
    resp = self.send(prep, **send_kwargs)
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 597, in send
    history = [resp for resp in gen] if allow_redirects else []
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 195, in resolve_redirects
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 576, in send
    r = adapter.send(request, **kwargs)
    File “/home/2014/dchen53/anaconda2/lib/python2.7/site-packages/requests/”, line 433, in send
    raise SSLError(e, request=request)
    requests.exceptions.SSLError: EOF occurred in violation of protocol (_ssl.c:590)



    I don’t know if he’s posted it somewhere, but Olexa’s post on numpy’s einsum is excellent. Einsum is a really general way to do linear algebra things, where you only have to think about the dimensions of the matrices you have, and specify that in a string.

    Not sure if this picture will show up, but it shows how it allows you to implement forward prop and backprop in less than a dozen lines.


  6. I do not know why I get this error when i am doing the blocks tutorial

    import blocks

    from blocks.bricks import bn
    Traceback (most recent call last):
    File “”, line 1, in
    File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/blocks/bricks/”, line 3, in
    from .bn import (BatchNormalization, SpatialBatchNormalization,
    File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/blocks/bricks/”, line 7, in
    from theano.tensor.nnet import bn
    ImportError: cannot import name bn


  7. On Hades, I encountered this message after trying to use GPUs with theano :

    WARNING (theano.tensor.blas): Failed to import scipy.linalg.blas, and Theano flag blas.ldflags is empty. Falling back on slower implementations for dot(matrix, vector), dot(vector, matrix) and dot(vector, vector) ( cannot open shared object file: No such file or directory)

    I didn’t do anything to THEANO’s setting and ran the code after import the module using “module add theano” on the cluster.


  8. Hello, Bart, you sent us an email about moving our dog_cats dataset to common folder, I followed this instruction: But get this error:
    I do not know why:
    from fuel.datasets.dogs_vs_cats import DogsVsCats
    Traceback (most recent call last):
    File “”, line 1, in
    File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/fuel/”, line 2, in
    from fuel.config_parser import config # noqa
    File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/fuel/”, line 209, in
    File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/fuel/”, line 129, in load_yaml
    for key, value in yaml.safe_load(f).items():
    AttributeError: ‘str’ object has no attribute ‘items’
    Here is my .fuelrc content: data_path:”/home2/ift6ed05/fuel_data”:”/home2/COMMON”
    Here is my .bashrc content:
    export PATH=”/home2/ift6ed05/anaconda2/bin:$PATH”
    export PATH=”$PATH:$HOME/ffmpeg-2.8.6/bin”
    export FUEL_DATA_PATH=/home/COMMON:$HOME/fuel_data
    And I did source .bashrc after adding this line.


    • And if I run this command,
      I get similar error:
      fuel-download dogs_vs_cats
      Traceback (most recent call last):
      File “/home2/ift6ed05/anaconda2/bin/fuel-download”, line 9, in
      load_entry_point(‘fuel==0.1.1’, ‘console_scripts’, ‘fuel-download’)()
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/”, line 547, in load_entry_point
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/”, line 2719, in load_entry_point
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/”, line 2379, in load
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/”, line 2385, in resolve
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/fuel/”, line 2, in
      from fuel.config_parser import config # noqa
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/fuel/”, line 209, in
      File “/home2/ift6ed05/anaconda2/lib/python2.7/site-packages/fuel/”, line 129, in load_yaml
      for key, value in yaml.safe_load(f).items():
      AttributeError: ‘str’ object has no attribute ‘items’


    • Florian Bordes says:

      I think it’s :
      export FUEL_DATA_PATH=/home2/COMMON
      instead of /home/COMMON, the repository /home/COMMON doesn’t exist.
      Or if you want to use the fuelrc file, it’s:
      data_path: “/home2/COMMON:/home2/ift6ed05/fuel_data”


      • bartvanmerrienboer says:

        I missed this comment earlier, but in case you still have the same error: It seems to say there is something wrong with your .fuelrc file. It should contain something like data_path: /home2/COMMON but instead it seems to contain just a string.


      • Hello, Bart, it still has the same problem, I reinstalled ananconda, fuel, blocks, it has solved the problem , but sadly I am downloading the fuel data on my home folder. Wishing it does not give much trouble.


    • bartvanmerrienboer says:

      This is called checkpointing, and there are many different ways it can be achieved at varying degrees. With neural network training, you generally just want to make sure that you store your parameters every N steps, so that you can resume training from those parameters when things go wrong. Blocks implements this using the Checkpoint extension, which will allow you to resume training easily.


  9. I encountered this problem when trying to get data from a server stream.

    In one notebook I set up a minimalistic server as follows:

    <br />from fuel.datasets.dogs_vs_cats import DogsVsCats
    from fuel.streams import DataStream
    from fuel.schemes import ShuffledScheme
    train = DogsVsCats(('train',), subset=slice(0, 20000))
    stream = DataStream(train,
    iteration_scheme=ShuffledScheme(train.num_examples, 128))
    from fuel.server import start_server

    Then in another notebook I initiate an instance of Serverstream and try to get one batch from the server as follows:

    <br />from fuel.streams import ServerDataStream
    data_stream = ServerDataStream(('image_features','target'),False)
    iterator = data_stream.get_epoch_iterator()
    one_batch = next(iterator)

    which then returns this error :

    <br />---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-6-82a2181a96af> in <module>()
          2 data_stream = ServerDataStream(('image_features','target'),False)
          3 iterator = data_stream.get_epoch_iterator()
    ----> 4 one_batch = next(iterator)
    /Users/patricklau/anaconda3/lib/python3.5/site-packages/fuel/ in __next__(self)
         30             data = self.data_stream.get_data(next(self.request_iterator))
         31         else:
    ---> 32             data = self.data_stream.get_data()
         33         if self.as_dict:
         34             return dict(zip(self.data_stream.sources, data))
    /Users/patricklau/anaconda3/lib/python3.5/site-packages/fuel/ in get_data(self, request)
        231         if not self.connected:
        232             self.connect()
    --> 233         data = recv_arrays(self.socket)
        234         return tuple(data)
    /Users/patricklau/anaconda3/lib/python3.5/site-packages/fuel/ in recv_arrays(socket)
         73         data = socket.recv()
         74         buf = buffer_(data)
    ---> 75         array = numpy.frombuffer(buf, dtype=numpy.dtype(header['descr']))
         76         array.shape = header['shape']
         77         if header['fortran_order']:
    ValueError: cannot create an OBJECT array from memory buffer


    • bartvanmerrienboer says:

      Judging from the error, you are trying to use the server to send object arrays i.e. NumPy arrays that contain Python objects instead of sending an n-dimensional tensor.

      You will need to crop the images to be of the same size, so that you can send a single NumPy array with the shape (batch, channel, height, width) instead of sending a list of arrays where height and width are different, which is what you’re currently trying to do.


  10. Is anyone training a feed-forward neural network with Batch Normalization? If so, what are your strategies to learn the validation-time mean and variance: Compute over full training set, or a random subset?

    Also, is anyone training a neural network with many BN layers? If so, how do you efficiently manufacture all of the “partial” neural networks (Input-to-BN1, Input-to-BN2, Input-to-BN3, …) and efficiently learn the validation-time mean and variance for each BN layer one after the other? Or are there other strategies to bypass this annoyance?


  11. Pingback: Questions | IFT6266 H-2016 Deep Learning

  12. I have a qustion about mini batch training, if my training set is smaller than my validation set ,like training set is 5000, validation set is 45000, should I do mini batch on the validation set too, otherwise I am facing memory issue.


  13. I don’t want to put the question on one of the lectures from way back, so I will put it here.

    This question is about the initialization methodology proposed by Glorot and Bengio (2010) and its application in the context of convolutional networks.

    In the paper, it was suggested to initialize each layer of an mlp with sqrt(6)/(sqrt(n_i + n_(i+1)), where n_i is the number of units in the current layer and n_(i+1) the number of units in the next layer.

    How would we adapt it to conv nets ? In one of the deep learning tutorials with theano the initialization was implemented as

    n_i = num input feature maps * filter height * filter width
    n_(i+1) = num output feature maps * filter height * filter width / pooling_size

    Can someone explain where these numbers come from? It is not immediately obvious how to interpret a convolutional layer as an mlp layer in order to understand the initialization. First there is the issue of number of dimensions, then it is that of converting a conv net to a mlp .

    Liked by 1 person

    • This confused me too, but I think I found clarity by looking at Caffe’s source code.

      This blog post ( talks about Xavier/Glorot initialization, and links to Caffe’s implementation here (

      Therein, we learn from the comments at lines 137-140 that:

      It fills the incoming matrix by randomly sampling uniform data from [-scale, scale] where scale = sqrt(3 / fan_in) where fan_in is the number of input nodes. You should make sure the input blob has shape (num, a, b, c) where a x b x c = fan_in.


      • I don’t see how he makes it any clearer about the application in conv nets. In a convolutional network, what we think of as one sample (an image) is not one sample when the conv net is thought of as an mlp, otherwise we would take

        n_in = num input feature maps * num image row * num image column

        If we think of an image as many samples, each of which a local window the size of the filter (3 x 3, say), then we still leave

        n_out = num output feature maps * num filter row * num filter column / pool size

        unexplained. The formula above is essentially saying that we are looking at the convolution operation as mapping from R^(n_in) to R^(n_out), only that is not true.

        Take a simple example where the incoming feature maps are just the three channels of the original image. Assume there are only two output feature maps and we focus on W_1, a 2 x 2 filter that maps to the first output feature map.

        n_in = num input feature maps * num image row * num image column

        tells us that we are grouping the three 2×2 windows from the input channels as one sample and this gets mapped to 1 pixel in the output feature map.

        I just don’t see how we could explain n_out.


      • The formulas you give identify:
        1. The total number of inputs that contribute to any single output; This is called the fan-in.
        2. The total number of outputs to which any single input contributed; This is called the fan-out.

        Visualize a tensor of size (#InputMaps, Height, Width) with all elements = +1.0, as if generated by numpy.ones(). Also visualize a convolution filter of size (#OutputMaps, #InputMaps, FHeight, FWidth), whose every filter tap is also = +1.0.

        If you perform the convolution, you will find that the output tensor is not an array of +1.0’s of size approximately (#OutputMaps, Height, Width). Instead, every entry will be equal to #InputMaps * FHeight * FWidth. Because every filter-tap-by-image-pixel product is +1.0*+1.0 = +1.0, that proves that a large number of pixels contributed to this output. This number is called the fan-in, and that’s why fan-in is calculated as part of the computation of the normalization constant.

        A similar logic applies to the reverse direction in which gradients flow. We’re now interested in fan-out instead of fan-in, and there’s the additional complexity of pooling, which cuts down the effective number of outputs to which a single output is connected (the definition of fan-out).

        The square roots in the formulas relates to the behaviour of a sum of independent variables. Going back to the convolution filter of size (#OutputMaps, #InputMaps, FHeight, FWidth) above, imagine that instead of every filter tap and image pixel being = +1.0, you have 50% chance of being +1.0 and 50% chance of being -1.0. The variance of the sum of N independent random variables is the sum of the variances, so its standard deviation (the expected magnitude of the sum) will be the square root of that. The “gain” of the filter in the forward direction will thus be on the order of the square root of fan-in. The same sort of logic applies to the reverse direction and fan-out.

        Alternatively, you can view the computation of a filter’s output as having random walk behaviour ( The filter-tap-by-image-pixel products from my example above are +1 or -1 with 50% probability each, and the sequential accumulation of partial products constitutes a random walk. Given that it’s a random walk, the expected distance from 0.0 (and thus, the magnitude of the activation) will be on the order of \sqrt{n} after n partial product accumulations. Therefore, after fan_in accumulations, the filter’s expected “gain” is proportional to sqrt(fan_in), and the normalization constant is the inverse of that. sqrt(2.0/(fan_in + fan_out)) is a compromise normalization constant between the forward and backwards directions.


  14. Another convolutional network related question: in most of the literature we follow a convolution operation by a non-linear activation function and then a downsample operation. However, in theano’s deep learning tutorial it is actually implemented as a convolution->downsample->tanh sequence. Does it make any difference whether we downsample or apply non-linear activation first ? In this case it seems like the theano tutorial implemented the conv layer incorrectly. But it begs the question of why the literature always construct a conv layer by stacking convolution->non-linear activation -> downsample sequences. Also, it doesn’t seem like there is theoretical justification to favour one way over the other, does it ?


    • If the pooling is a max-pooling layer and the non-linearity is monotonically increasing (ReLU, tanh, sigmoid and numerous others are), then the pooling and non-linearity layers can commute, because the locally maximal pre-activation before the non-linearity will correspond to a locally maximal activation after it.

      So if either order is equivalent, the decision is purely computational. Doing a (2,2) max-pooling and then applying a monotonically-increasing non-linearity costs (2*2) = 4x less in non-linearity evaluations than doing the reverse, yet they produce bitwise-identical results.

      The bad news: This doesn’t apply to avg-pooling. The good news: I can’t think of a single non-monotonically-increasing non-linearity in use.


  15. In class, I had proposed for the Voice Synthesis project that instead of directly generating the waveform, and thus having to deal with continuity issues, that one could try modulating a bank of sinusoids. Yoshua grokked what I had suggested and wrote it down on the board, but afterwards I came up with an improvement. If we always require two signals of each frequency, one phase-offset by 90 degrees from the other, then by an appropriate choice of amplitudes of each one may always synthesize an arbitrary phase shift, without an explicit phase parameter. I summarize this in my blog post here:


  16. Part of the reason why I constructed my dataset with inpainted 256×256 images, yet use a 192×192 input, is to allow some wiggle room to rotate and translate without black patches appearing.

    But on the other hand, an argument can be made that these black patches should stay, because your neural network should learn to be robust to junk at the very border. It might be useful to fill that black area with random noise for that very reason.


      • Let’s try an answer.

        I’m gonna start by the definition of the L2 regularization given in the book (p231).
        The L2 regularization means to use instead of J(w;X,y) an other cost function which is : \tilde{J}(w;X,y)=J(w;X,y)+\frac{\alpha}{2} w^Tw

        Now let’s see what means taken a Gaussian prior over the parameters when you update them:
        You have done your training on the input X , and you want to find better parameters w knowing those inputs, If you do a MAP :
        You want to maximize over w:
        P(w|X) = \frac{p(X|W)P(W)}{P(X)}

        Which is the same to maximize over w: log(P(X|w) + log(P(w))

        At this point if you do an identification :
        log(P(x|W) is homogeneous to the probability of the inputs given the parameters so it’s homogeneous to the cost function J(w,X,y)

        log(P(w)) is homogeneous to the prior you have over the weights w.
        If you have a prior Gaussian over w, for example: w \sim \mathbb{N}(\frac{I}{\alpha},0)
        log(P(w))=\frac{\alpha}{2}w^Tw+ Cste
        (you throw away the constant because it doesn’t change anything when you maximize the formula and will be vanish if you derive)

        So by using an L2 regularization, we act like you perform a MAP with the Gaussian prior over your weight and you maximize P(w|X) by doing your gradient descent.

        Liked by 1 person

      • Jonathan says:

        I have gotten to the same points as you in the demonstration of L2 being equivalent to a Gaussian prior and L1 being equivalent to a Laplacian prior. The only point that I think is missing is how you get to:
        1. $w^Tw \sim \mathbb{N}(\frac{I}{\alpha},0)$
        2. $w \sim Laplace(0,\frac{1}{\alpha})$

        You just needed to add that for L2:
        P(w)=\frac {1}{\sqrt {2\pi}\sigma}e^{-\frac {(w-\mu)^2}{2\sigma^2}}\
        logP(w) = c_{1}w^{2} if\ \mu = 0\ which\ is\ the\ case\ when\ w \sim \mathbb{N}(\frac{I}{\alpha},0)

        And for L1, it is probably some similar derivation but I don’t know the formula for the LaPlacian distribution


    • You can look at linear regression in a probabilistic fashion, in which the likelihood, p(y|x), is a Gaussian density that is N(f(x); y, \sigma^2, where y is the mean and \sigma^2 is the variance. If you take the log of the density, then you can get rid of the normalising constant (and the sigma term), in which you get the typical squared error term (y - f(x))^2.


  17. Jonathan says:

    For a neural network, what loss function that makes sense for binary targets can be
    interpreted as a minus a conditional log-likelihood ?


  18. Jonathan says:

    Consider a function f returning a random output given an input state x ∈ X ,
    returning a new state x 0 ∈ X , i.e., f outputs a sample from a conditional distribution
    Q(x0|x), and its successive application would generate a Monte-Carlo Markov chain.
    Let P represent the asymptotic distribution of that chain (and assume it exists), i.e., a
    distribution on the space X . What relationship must hold between P and Q?

    Liked by 1 person

    • The multiple application of Q to generate sample should make converge the set of sample into a set of sample representative of the P distribution, I think.
      Or if Q(x_a|x_{a-1})=Q(x_{a+1}|Q(x_a|x_{a-1})) then Q=P


  19. For the 2012 final, question 4: I’m not entirely sure what it’s asking. Is 1) a sparse autoencoder, and 2) a sparse denoising autoencoder? Or is 2 just a denoising autoencoder? In either case, I don’t see how c) corruption level factors into 1 – doesn’t it become a denoising autoencoder as soon as you add noise to the input? Or are we also saying ithat you are adding noise to the target?

    In general, for that question I’m not sure about the difference in behaviour between 1) and 2) (unless (c) doesn’t apply to (1) )


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s