How Much Information Does Stem Separation Actually Lose?

Key finding Splitting into 6 stems vs 4 stems increases reconstruction error by 1.2 dB (-21.6 dB for 4-stem vs -22.8 dB for 6-stem relative to original)

Methodology

Each mixture in the MUSDB18-7s test set was separated into stems, which were then summed back together. The sum was compared to the original mixture using three metrics:

Mean squared error (MSE): direct amplitude difference between original and reconstruction
Pearson correlation: captures temporal synchrony and shape fidelity between original and reconstruction
dB difference: the reconstruction error expressed as a level relative to the original signal, in dB

Separation modes tested:

2-stem: vocals and accompaniment (accompaniment = drums + bass + other summed from a 4-stem run)
4-stem: vocals, drums, bass, other using htdemucs_ft
6-stem: vocals, drums, bass, guitar, piano, other using htdemucs_6s

Device: Apple M4 MPS. Dataset: MUSDB18-7s (50 test tracks).

What these numbers tell you

A perfect reconstruction would have MSE = 0, correlation = 1.0, and dB difference of -infinity (no difference signal at all). Real-world reconstruction is imperfect because the separation model introduces filter ringing, bleed, and frequency masking artefacts that don’t cancel when stems are summed.

The practical implication: if you separate a track and then re-mix the stems at equal volume, the result will not be bit-identical to the original mix. The dB difference number shows how audible that difference is in principle.

Reconstruction Metrics by Stem Count

Lower MSE = better. Higher correlation = better. More negative dB difference = better (difference signal is quieter relative to original).

Mode	Tracks	Mean MSE	Correlation (r)	Difference (dB)
2-stem (vocals + accompaniment)	50	0.000169	0.9961	-21.7 dB
4-stem (vocals, drums, bass, other)	50	0.000169	0.996	-21.6 dB
6-stem (+ guitar, piano)	50	0.000138	0.9967	-22.8 dB