Key finding Splitting into 6 stems vs 4 stems increases reconstruction error by 1.2 dB (-21.6 dB for 4-stem vs -22.8 dB for 6-stem relative to original)

Methodology

Each mixture in the MUSDB18-7s test set was separated into stems, which were then summed back together. The sum was compared to the original mixture using three metrics:

  • Mean squared error (MSE): direct amplitude difference between original and reconstruction
  • Pearson correlation: captures temporal synchrony and shape fidelity between original and reconstruction
  • dB difference: the reconstruction error expressed as a level relative to the original signal, in dB

Separation modes tested:

  • 2-stem: vocals and accompaniment (accompaniment = drums + bass + other summed from a 4-stem run)
  • 4-stem: vocals, drums, bass, other using htdemucs_ft
  • 6-stem: vocals, drums, bass, guitar, piano, other using htdemucs_6s

Device: Apple M4 MPS. Dataset: MUSDB18-7s (50 test tracks).

What these numbers tell you

A perfect reconstruction would have MSE = 0, correlation = 1.0, and dB difference of -infinity (no difference signal at all). Real-world reconstruction is imperfect because the separation model introduces filter ringing, bleed, and frequency masking artefacts that don’t cancel when stems are summed.

The practical implication: if you separate a track and then re-mix the stems at equal volume, the result will not be bit-identical to the original mix. The dB difference number shows how audible that difference is in principle.

Reconstruction Metrics by Stem Count

Lower MSE = better. Higher correlation = better. More negative dB difference = better (difference signal is quieter relative to original).

Mode Tracks Mean MSE Correlation (r) Difference (dB)
2-stem (vocals + accompaniment) 50 0.000169 0.9961 -21.7 dB
4-stem (vocals, drums, bass, other) 50 0.000169 0.996 -21.6 dB
6-stem (+ guitar, piano) 50 0.000138 0.9967 -22.8 dB