...

/

MuseGAN—Polyphonic Music Generation

MuseGAN—Polyphonic Music Generation

Learn how to generate polyphonic music using MuseGAN.

The two models we have trained so far have been simplified versions of how music is actually perceived. While limited, both the attention-based LSTM model and the C-RNN-GAN-based model helped us understand the music generation process very well. In this section, we’ll build on what we’ve learned so far and move toward preparing a setup that is as close to the actual task of music generation as possible.

In 2017, Dong et al. presented a GAN-type framework for multi-track music generation in their work, “MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and AccompanimentDong, Hao-Wen, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. n.d. “MuseGAN: Multi-Track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment.” https://salu133445.github.io/musegan/pdf/musegan-aaai2018-paper.pdf..” The paper is a detailed explanation of various music-related concepts and how Dong and the team tackled them.

Challenges

Let’s understand the three main properties related to music that the MuseGAN work tries to take into account:

  • Multi-track interdependency: As we know, most songs that we listen to are usually composed of multiple instruments such as drums, guitars, bass, vocals, and so on. There is a high level of interdependency in how these components play out for the end user/listener to perceive coherence and rhythm.

  • Musical texture: Musical notes are often grouped into chords and melodies. These groupings are characterized by a high degree of overlap and not necessarily chronological ordering (this simplification of chronological ordering is usually applied in most known works associated with music generation). The chronological ordering comes not only as part of the need for simplification but also as a generalization from the NLP domain, language generation in particular.

  • Temporal structure: Music has a hierarchical structure where a song can be seen as being composed of paragraphs (at the highest level). A paragraph is composed of various phrases, which are, in turn, composed of multiple bars, and so on. The figure below depicts this hierarchy pictorially:

Press + to interact
Temporal structure of a song
Temporal structure of a song

As shown in the figure, a bar is further composed of beats, and at the lowest level, we have pixels. The authors of MuseGAN mention a bar as the compositional unit, as opposed to notes, which we have been considering the basic unit so far. This is done to account for the grouping of notes from the multi-track setup.

Solutions

MuseGAN works toward solving these three major challenges through a unique framework based on three music generation approaches. These three basic approaches make use of jamming, hybrid, and composer models. We'll briefly explain these now.

Jamming model

If we were to extrapolate the simplified monophonic GAN setup from the previous section to a polyphonic setup, the simplest method would be to use multiple generator-discriminator combinations, one for each instrument. The jamming model is precisely this setup, where MM multiple independent generators prepare music from their respective random vectors. Each generator has its own critic/discriminator, which helps in training the overall GAN. This setup is depicted in the figure below:

Press + to interact
Jamming model
Jamming model

As shown in the preceding figure, the jamming model is composed of MM generator and discriminator pairs for generating multi-track outputs. It imitates a grouping of musicians who create music by improvising independently and without any predefined arrangement.

Composer model

As the name suggests, this setup assumes that the generator is a typical human composer capable of creating multi-track piano rolls, as shown in the figure below:

Press + to interact
Composer model
Composer model

As shown in the figure, the composer model consists of a single generator capable of generating tracks and a single discriminator for detecting fake (generated) versus real samples. This model requires only one common random vector, as opposed to random vectors in the previous jamming model setup.

Hybrid model

This is an interesting take that arises by combining the jamming and composer models. The hybrid model has MM ...