Ziyu Wang, Lejun Min, Gus Xia @MusicXLab
[Paper] | [Code Repo]

This is the demo page of the paper under review: Whole-song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models. We make the first attempt to model full pop songs (melody and piano accompaniment) under the realization of compositional hierarchy.

In particular, we define 4-levels of hierarchical music languages, where each level of the hierarchy focuses on the context dependency at certain music scope, from the high-level whole-song form, phrase, and cadence to low-level notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels.

An Example of Hierarchical Whole-song Generation

Here is a whole-song generation example of 40 bars including melody and accompaniment. We use different background colors to indicate phrase division of its music form, where we see a clear chorus-verse structure.

Intro (i) / Outro (o)
Verse (Phrase A)
Chorus (Phrase B)
Bridge (b)

This whole piece is generated by the model in a top-down fashion. First, it generates the top-level music language, music Form, which involves key and phrase division. The following Form visualizes a piece in Ab major (shown as the tonic key and the scale notes) with the phrase division of i4A4A4B8b4A4B8o4 (4-measure intro, 4-measure phrase A, etc., shown as the background colors).


Next, based on the Form above, the model generates the 2nd-level music language, Reduced Lead Sheet, which involves reduced melody and simplified chord. This level of language shows phrase development and structure, including phrase similarity, cadence of phrases using harmony and melody.


Then, based on the Form and Reduced Lead Sheet above, the model generates the 3rd-level music language, Lead Sheet, which involves lead melody and chord. This level of language shows the realization of the actual lead melody and harmony.


Finally, based on the Form, Reduced Lead Sheet, and Lead Sheet above, the model generates the final-level music language, piano Accompaniment.



Variations & Controls on Each Level

With the four-level hierarchical music languages, more abstract music concepts at higher levels are realized by stylistic specifications at lower levels. We show 8-measure generated variations on each level to demonstrate that our generation is diverse, high-quality, and controllable.

Reduced Lead Sheet

If we specify to generate an 8-measure verse in Eb major (by setting the top-level language Form), the model can generate Reduced Lead Sheet in a variety of styles:

External Control. Moreover, the style of the generation can be further constrained by a specified chord progression. It is encoded by a pre-trained VAE and cross-attended with the model’s layers. Here are the variations conditioned on consecutive Eb major chords.

Lead Sheet

Under the same Form as above, we further specify the Reduced Lead Sheet as follows:

The model can generate Lead Sheet in a variety of styles:

External Control. Moreover, the style of the generation can be further constrained by a specified rhythmic pattern. It is encoded by a pre-trained VAE and cross-attended with the model’s layers. Here are the variations conditioned on a dense rhythmic pattern.

Accompaniment

Finally, under the same Form and Reduced Lead Sheet as above, we further specify the Lead Sheet as follows:

The model can generate Accompaniment in a variety of styles (shown with the lead melody):

External Control. Moreover, the style of the generation can be further constrained by a specified musical texture. It is encoded by a pre-trained VAE and cross-attended with the model’s layers. Here is the variation conditioned on an Alberti bass texture, where left hand plays Eb quarter note and the right hand plays Alberti pattern in Eb chord.


More Examples of Whole-song Generation

We list more whole-song samples (lead melody and piano accompaniment) categorized by their phrase division.

Intro (i) / Outro (o)
Phrase A
Phrase B
Bridge (b)
  • i4A4A4B8b4A4B8o4

  • A8B8A8B8B8

  • i4A4B4b8A4B4o4


Thanks html-midi-player for the excellent MIDI visualization.