We tested several source separation methods on a dataset of synthesized Bach chorales. Below are listening examples for each of the methods we tested. The code used to create the chorales dataset is available in the synthesize-chorales repository.
The following listening examples are from chorale BWV 360 (chorale number 350 in the Riemenschneider edition). The chorale’s score is available here.
Baseline: Score-informed NMF
As a baseline, we have implemented the score-informed NMF technique described in this paper (PDF):
Ewert, S., & Müller, M. (2012). Using score-informed constraints for NMF-based source separation. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 129-132).
Our Python implementation is available in the score-informed-nmf repository. We ran four experiments:
- Experiment A: parameter values from the original paper
- Experiment B: smaller activation tolerances
- Experiment C: smaller frequency tolerance
- Experiment D: larger STFT window
mix | soprano | alto | tenor | bass |
---|---|---|---|---|
reference | ||||
Ex. A | ||||
Ex. B | ||||
Ex. C | ||||
Ex. D | ||||
Wave-U-Net
We evaluated a deep learning separation technique called Wave-U-Net on our dataset. Wave-U-Net is described in the following paper (arXiv):
Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (pp. 334–40).
Soprano and bass mixtures
Experiment 1 tested extracting soprano and bass from mixtures containing only these two voices.
mix | soprano | bass |
---|---|---|
reference | ||
Ex. 1 | ||
Extract all four voices
Experiments 2–3 tested extracting all four voices from mixtures of all four voices. Experiment 2 used one model to extract all voices, and Experiment 3 used one model per voice.
mix | soprano | alto | tenor | bass |
---|---|---|---|---|
reference | ||||
Ex. 2 | ||||
Ex. 3 | ||||
Higher-variability dataset
Experiment 4 tested a dataset that contains more sources of variability compared to the original Bach chorales:
- Tempo is varied (75–100 BPM)
- Simulated breaths are inserted every 8 beats
- Randomly chosen notes from each voice are omitted
mix | soprano | alto | tenor | bass |
---|---|---|---|---|
reference | ||||
Ex. 4 | ||||
Score-informed Wave-U-Net
We created a variant of Wave-U-Net that is conditioned on the musical score. Our code is available in the score-informed-Wave-U-Net repository. We experiment with different score representations and conditioning locations.
One model per voice vs. multi-source training
Experiments 6–7 used the normalized pitch score representation and the input-output conditioning locations. Experiment 6 used one model per voice, whereas Experiment 7 used multi-source training, in which one model is trained to extract any of the four voices guided only by the score.
mix | soprano | alto | tenor | bass |
---|---|---|---|---|
reference | ||||
Ex. 6 | ||||
Ex. 7 | ||||
Comparison of score conditioning methods
Experiments 8–10 compared 4 score representations (normalized pitch, pitch and amplitude, piano roll, and pure tone) and 3 score conditioning locations (input, output, and input-output) – a total of 12 configurations in each experiment. All three experiments used the higher-variability dataset. Each experiments used a different model type:
- Experiment 8: one model for all voices
- Experiment 9: one model per voice (tested on tenor only, which is the most challenging to separate)
- Experiment 10: multi-source training
Experiment 8
mix | soprano | alto | tenor | bass | |
---|---|---|---|---|---|
reference | |||||
score type | conditioning location | ||||
normalized pitch | input | ||||
output | |||||
input-output | |||||
pitch and amplitude | input | ||||
output | |||||
input-output | |||||
piano roll | input | ||||
output | |||||
input-output | |||||
pure tone | input | ||||
output | |||||
input-output | |||||
Experiment 9
mix | tenor | |
---|---|---|
reference | ||
score type | conditioning location | |
normalized pitch | input | |
output | ||
input-output | ||
pitch and amplitude | input | |
output | ||
input-output | ||
piano roll | input | |
output | ||
input-output | ||
pure tone | input | |
output | ||
input-output | ||
Experiment 10
In this experiment, the output conditioning location failed to train, so results are only shown for input and input-output.
mix | soprano | alto | tenor | bass | |
---|---|---|---|---|---|
reference | |||||
score type | conditioning location | ||||
normalized pitch | input | ||||
input-output | |||||
pitch and amplitude | input | ||||
input-output | |||||
piano roll | input | ||||
input-output | |||||
pure tone | input | ||||
input-output | |||||
Download
A zip file containing all listening examples in original quality can be downloaded here (382 MB).