Key finding The strongest predictor of vocal SDR is 'Chroma variance (harmonic complexity)' (r = 0.522). Tracks with higher harmonic-to-percussive ratio tend to separate more cleanly.

Methodology

Audio features were extracted from each MUSDB18-7s mixture using librosa. These features were then correlated with the per-track SDR values from the htdemucs_ft model comparison run (Test 2) using Pearson r.

Features computed:

  • RMS energy (overall loudness)
  • Spectral centroid (brightness)
  • Spectral flatness (how tonal vs noisy the signal is)
  • Zero crossing rate (noisiness proxy)
  • Harmonic and percussive energy (via HPSS decomposition)
  • Harmonic-to-percussive ratio
  • Onset density (rhythmic activity)
  • Chroma variance (harmonic complexity)
  • Vocal band energy (200Hz-3kHz)

Each feature was correlated separately with vocal, drums, bass, and other SDR. The table shows the top predictors for vocal SDR specifically.

Interpretation

A positive correlation means tracks with a higher feature value tend to produce higher SDR. A negative correlation means the feature is associated with worse separation. Values close to 0 indicate no reliable relationship in this dataset.

Pearson r magnitudes above 0.3 are considered moderate; above 0.5 are substantial. With 50 tracks, even moderate correlations should be interpreted carefully.

Top Predictors of Vocal Separation Quality

Pearson r with vocal SDR. Positive r = higher feature value predicts better separation. Negative r = higher value predicts worse separation.

Audio Feature Pearson r n tracks
Chroma variance (harmonic complexity) 0.522 50
Spectral flatness (tonal vs noisy) -0.265 50
Zero crossing rate (noisiness) -0.256 50
Overall loudness (RMS energy) -0.235 50
Vocal band energy (200-3kHz) -0.221 50