What Makes a Track Hard to Separate?

Key finding The strongest predictor of vocal SDR is 'Chroma variance (harmonic complexity)' (r = 0.522). Tracks with higher harmonic-to-percussive ratio tend to separate more cleanly.

Methodology

Audio features were extracted from each MUSDB18-7s mixture using librosa. These features were then correlated with the per-track SDR values from the htdemucs_ft model comparison run (Test 2) using Pearson r.

Features computed:

RMS energy (overall loudness)
Spectral centroid (brightness)
Spectral flatness (how tonal vs noisy the signal is)
Zero crossing rate (noisiness proxy)
Harmonic and percussive energy (via HPSS decomposition)
Harmonic-to-percussive ratio
Onset density (rhythmic activity)
Chroma variance (harmonic complexity)
Vocal band energy (200Hz-3kHz)

Each feature was correlated separately with vocal, drums, bass, and other SDR. The table shows the top predictors for vocal SDR specifically.

Interpretation

A positive correlation means tracks with a higher feature value tend to produce higher SDR. A negative correlation means the feature is associated with worse separation. Values close to 0 indicate no reliable relationship in this dataset.

Pearson r magnitudes above 0.3 are considered moderate; above 0.5 are substantial. With 50 tracks, even moderate correlations should be interpreted carefully.

Top Predictors of Vocal Separation Quality

Pearson r with vocal SDR. Positive r = higher feature value predicts better separation. Negative r = higher value predicts worse separation.

Audio Feature	Pearson r	n tracks
Chroma variance (harmonic complexity)	0.522	50
Spectral flatness (tonal vs noisy)	-0.265	50
Zero crossing rate (noisiness)	-0.256	50
Overall loudness (RMS energy)	-0.235	50
Vocal band energy (200-3kHz)	-0.221	50