MuseGAN—Polyphonic Music Generation
Learn how to generate polyphonic music using MuseGAN.
The two models we have trained so far have been simplified versions of how music is actually perceived. While limited, both the attention-based LSTM model and the C-RNN-GAN-based model helped us understand the music generation process very well. In this section, we’ll build on what we’ve learned so far and move toward preparing a setup that is as close to the actual task of music generation as possible.
In 2017, Dong et al. presented a GAN-type framework for multi-track music generation in their work, “MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and
Challenges
Let’s understand the three main properties related to music that the MuseGAN work tries to take into account:
Multi-track interdependency: As we know, most songs that we listen to are usually composed of multiple instruments such as drums, guitars, bass, vocals, and so on. There is a high level of interdependency in how these components play out for the end user/listener to perceive coherence and rhythm.
Musical texture: Musical notes are often grouped into chords and melodies. These groupings are characterized by a high degree of overlap and not necessarily chronological ordering (this simplification of chronological ordering is usually applied in most known works associated with music generation). The chronological ordering comes not only as part of the need for simplification but also as a generalization from the NLP domain, language generation in particular.
Temporal structure: Music has a hierarchical structure where a song can be seen as being composed of paragraphs (at the highest level). A paragraph is composed of various phrases, which are, in turn, composed of multiple bars, and so on. The figure below depicts this hierarchy pictorially:
As shown in the figure, a bar is further composed of beats, and at the lowest level, we have pixels. The authors of MuseGAN mention a bar as the compositional unit, as opposed to notes, which we have been considering the basic unit so far. This is done to account for the grouping of notes from the multi-track setup.
Solutions
MuseGAN works toward solving these three major challenges through a unique framework based on three music generation approaches. These three basic approaches make use of jamming, hybrid, and composer models. We'll briefly explain these now.
Jamming model
If we were to extrapolate the simplified monophonic GAN setup from the previous section to a polyphonic setup, the simplest method would be to use multiple generator-discriminator combinations, one for each instrument. The jamming model is precisely this setup, where
As shown in the preceding figure, the jamming model is composed of
Composer model
As the name suggests, this setup assumes that the generator is a typical human composer capable of creating multi-track piano rolls, as shown in the figure below:
As shown in the figure, the composer model consists of a single generator capable of generating tracks and a single discriminator for detecting fake (generated) versus real samples. This model requires only one common random vector, as opposed to random vectors in the previous jamming model setup.
Hybrid model
This is an interesting take that arises by combining the jamming and composer models. The hybrid model has