Vocal Synthesis Project

The Vocal Synthesis Project

This project is an exercise in the area of generative modeling. The trained model should be able to generate sounds that are similar to those in the training data. A generative process necessarily involves randomness being injected, so that each time we run the generative process, we get a different sample.

One way to view a generative process is that it has two components: an ordinary random number generator (e.g., producing uniform or Gaussian variates) and a deterministic function that turns these iid random numbers into tuples of numbers (such as pixels in an image or acoustic samples in a signal) that are strongly codependent with each other. The generated variables should approximately have the same joint distribution as that in the data generating process that produced the training set.

In the case of sequences, a common way to construct a generative process is to decompose the joint distribution into a sequence of conditionals in which we generate the t-th block given the previously generated blocks. We can thus use a conditional distribution for each of these steps, something that neural networks are good at. We encourage you to play with various types of recurrent networks for this. Each time step will represent a block of acoustic samples (you will have to play with the size of the block to see what works well). The input of the RNN would be the current block (or a window of recent blocks) and the output would specify a distribution over the next block (the simplest would be a Gaussian distribution with a mean vector and a diagonal covariance matrix). The RNN state (hidden layers) will thus summarize all the past blocks. You are encouraged to explore different RNN architectures. For example GRU and LSTM architectures work better than vanilla tanh RNNs. Various forms of deep RNNs also tend to work better. The output distribution can also be made more complicated, for example using a Gaussian mixture rather than a single Gaussian. If the output is not well modelled by a Gaussian (e.g., the pitch is positive), then you need to choose another density.

A common approach in sound generation is to work in a representation space that is different from the raw acoustic space, and the main motivation is that human hearing is more sensitive to certain aspects of the signal such as the pitch (dominant low-frequency component of the signal), the envelope (the magnitude of the signal over a big block) and to the spectral signature (the amplitude of different frequency components in some spectral representation such as the Fourier spectrum), while being much less sensitive to other aspects (such as the phase of the different frequency components). You are free to play with different representations, but the simplest one is of course the raw acoustic (a sequence of real numbers, one per time step in the waveform).

There is no universally accepted way to evaluate the quality of a generative model so the ultimate evaluation comes from listening to generated sequences.  You can also visualize the waveform but ultimately the ear is a better judge. In the case where the model is explicitly computing the joint distribution of the data, the test set log-likelihood can be used as a quantitative benchmark, but only models that work on the same input space can be compared in this way.

Generative models are fun! but keep in mind that this is a difficult exercise, where you are sitting closer to the edge of research than the cats and dogs project.

Here is a lecture by Alex Graves on the subject of using generative RNNs to generate real-valued sequences such as handwriting and speech: https://www.youtube.com/watch?v=-yX1SYeDHbg

And a recent paper by Chung et al (NIPS’2015) on the same topic: http://arxiv.org/abs/1506.02216