Voice and Employees Separation in Symbolic Piano Music with GNNs | by Emmanouil Karystinaios | Oct, 2024

The large query is how can we make automated transcription fashions higher.

To develop a simpler system for separating musical notes into voices and staves, significantly for complicated piano music, we have to rethink the issue from a special perspective. We purpose to enhance the readability of transcribed music ranging from a quantized MIDI, which is essential for creating good rating engravings and higher efficiency by musicians.

For good rating readability, two parts are in all probability crucial:

  • the separation of staves, which organizes the notes between the highest and backside workers;
  • and the separation of voices, highlighted on this image with traces of various colours.
Voice streams in a piano rating

In piano scores, as mentioned earlier than, voices should not strictly monophonic however homophonic, which implies a single voice can comprise one or a number of notes taking part in on the identical time. Any longer, we name these chords. You may see some examples of chord highlighted in purple within the backside workers of the image above.

From a machine-learning perspective, now we have two duties to unravel:

  • The primary is workers separation, which is simple, we simply have to predict for every word a binary label, for prime or backside workers particularly for piano scores.
  • The voice separation job could seem comparable, in spite of everything, if we will predict the voice quantity for every voice, with a multiclass classifier, and the issue could be solved!

Nonetheless, instantly predicting voice labels is problematic. We would want to repair the utmost variety of voices the system can settle for, however this creates a trade-off between our system flexibility and the category imbalance inside the information.

For instance, if we set the utmost variety of voices to eight, to account for 4 in every workers as it’s generally achieved in music notation software program, we will count on to have only a few occurrences of labels 8 and 4 in our dataset.

Voice Separation with absolute labels

Trying particularly on the rating excerpt right here, voices 3,4, and eight are utterly lacking. Extremely imbalanced information will degrade the efficiency of a multilabel classifier and if we set a decrease variety of voices, we’d lose system flexibility.

The answer to those issues is to have the ability to translate the data the system discovered on some voices, to different voices. For this, we abandon the concept of the multiclass classifier, and body the voice prediction as a hyperlink prediction drawback. We wish to hyperlink two notes if they’re consecutive in the identical voice. This has the benefit of breaking a posh drawback right into a set of quite simple issues the place for every pair of notes we predict once more a binary label telling whether or not the 2 notes are linked or not. This method can be legitimate for chords, as you see within the low voice of this image.

This course of will create a graph which we name an output graph. To search out the voices we will merely compute the related parts of the output graph!

To re-iterate, we formulate the issue of voice and workers separation as two binary prediction duties.

  • For workers separation, we predict the workers quantity for every word,
  • and to separate voices we predict hyperlinks between every pair of notes.

Whereas not strictly mandatory, we discovered it helpful for the efficiency of our system so as to add an additional job:

  • Chord prediction, the place much like voice, we hyperlink every pair of notes in the event that they belong to the identical chord.

Let’s recap what our system appears to be like like till now, now we have three binary classifiers, one which inputs single notes, and two that enter pairs of notes. What we’d like now are good enter options, so our classifiers can use contextual info of their prediction. Utilizing deep studying vocabulary, we’d like word encoder!

We select to make use of a Graph Neural Community (GNN) as a word encoder because it typically excels in symbolic music processing. Subsequently we have to create a graph from the musical enter.

For this, we deterministically construct a brand new graph from the Quantized midi, which we name enter graph.

Creating these enter graph may be achieved simply with instruments similar to GraphMuse.

Now, placing every thing collectively, our mannequin appears to be like one thing like this: