Superhuman Level Music Generation With Deep/Reinforcement Learning

TL;DR

Can neural networks generate music better than a human composer/producer? Can we generate a hit song that surpasses Taylor Swift in popularity? Here we discuss one possible approach.

Abstract (please skip)

Blah blah. However, there are relatively few attempts at modeling general population's sound preference, not to mention generative approaches based on such models, Blah blah.

Methodology

Dataset

One can almost trivially collect freely available sound/music preference data from online music hosting services like SoundCloud. Each collected entry would look as follows:

Waveform of sound/music, Times played, Times liked

Since the sound/music is collected in its raw waveform form, there's no restriction on the type/kind of sound/music one can collect.

Data augmentation

amplitude scaling, pitch shifting, time streching, etc.

Metric

We believe that there's a common/general preference of certain sounds and their combinations over others among all human. For example it's very easy to name sounds that most people enjoy (piano) or detest (fingernail scratching). Therefore we define each sound/music sample's popularity with a scalar number p, where:

p = Times liked / Times played * log(Times liked)

Which ensures that:

One must be aware of the fact that some genres, like Black HipHop/Rap might be played and liked more times on SoundClound than Billboard music, which sounds good but obviously not free on SoundCloud, and expect racism in the data.

Rating network

Today it is common to train CNN models on sound samples' mel-spectrogram to extract features and fit objectives. This have been done by a lot of researchers from around the world, like in https://arxiv.org/pdf/1704.01280.pdf. Same have been done by the state-of-the-art speech synthesis model, with a modified Wavenet, as in https://arxiv.org/abs/1712.05884.

Thus one might construct a network as follows:

And train the network to reduce the mean squared error of its prediction of the popularity score, given the samples in our dataset.

Once the training is done, we got ourselves a model that could predict popularity given any sound sample of arbitary length. This network is called the Rating network.

Sample generation

Given such a rating network, how can we generate high rating music from it?

As presented in various deep learning research, given a discriminative model, one can trivially do gradient ascent/descent on input data with respect to some metric to obtain inputs that maximize/minimize such metric, or so-called "dreaming". With dreaming we can produce images that maximize the excitation to a neuron, or maximize the probability of belonging to a category, given a CNN.

We could do that with sounds as well: given a sound sample consists of only white noise, by doing gradient ascent on it with respect to its popularity score produced by the Rating network, one can easily obtain a sample with high popularity score.

Such samples, if generated naively as mentioned above, are likely to be noisy and meaningless -- just like adversarial samples. (adversarial samples: input samples constructed to fool a discriminative network) Fortunately we already know how to deal with such side effects when generating samples by dreaming, for example in https://distill.pub/2017/feature-visualization/.

The only problem with the gradient ascent/dreaming approach is that, since the Rating Network only captured features that could help it to rate sound/music, it basically ignored all the features that contributed little to the popularity score.

For example consider drum patterns. Human prefer repetitive drum patterns (more predictable, less random). But there may not be drum samples in the dataset that are randomly generated (come on, who would upload such garbage?), so the network might completely ignore the regularity of drum patterns when estimating popularity. As a result, the sample generated by gradient ascent might contain randomly placed drum hits, which is of course not realistic and annoying.

More technically, since our dataset can never cover the space of all possible sound samples (which is much larger than the space we live in), it is trivial to find an adversarial sample which does not appear in the dataset, and does not sound natural at all, yet able to produce high popularity score.

There are currently two approach to deal with this problem:

  1. Gradient ascent with adversarial Loss

    We train a discriminator network that classifies samples from our dataset as real, and samples generated by the dreaming process as fake.

    Then we do gradient ascent to maximize not only the popularity score, but also the realness score out of the discriminator. By doing so we effectively limit the space of generated sound samples to those sounded 'real'.

    This technique (ensure samples' realness by imposing adversarial loss) is widely used in image generation/image translation/image upsampling and obtained impressive results.

  2. Reinforcement Learning to Produce

    We train a reinforcement learning agent by asking him to act on a music production machine (a piano for example) sequentially to produce pieces of music.

    The goal is to maximize the reward signal, which is generated by evaluating the produced music with our rating network.

    Since the space of all possible sound samples is now limited to only those samples that can be generated by this machine, it became much harder to come up with adversarial samples that does not sound like anything.

Summary

That's about it. By successfully implementing this blogpost, you can earn yourself a position in the hall of fame of Computer Science.

20171223

file: musicgen.md

last modified: 2017-12-23 04:31