HTDemucs Model Variant Comparison: Quality vs Speed

Key finding HTDemucs (base) achieves the highest mean SDR at 8.38 dB

Methodology

All four publicly available HTDemucs variants were evaluated on the full 50-track MUSDB18-7s test set. Each track was processed independently per model and the wall-clock time recorded. SDR was computed against ground truth stems using BSSEval v4 via mir_eval.

Models evaluated:

htdemucs – base hybrid transformer-convolutional model
htdemucs_ft – fine-tuned on additional data, generally higher SDR on standard benchmarks
htdemucs_6s – 6-stem variant (adds guitar and piano stems); only standard 4-stem SDR reported for comparability
hdemucs_mmi – the older Hybrid Demucs (non-transformer) variant trained with multi-mirror input on extra data

Device: Apple M4 MPS. Models run at default segment size.

Notes on speed numbers

Times are per 7-second clip, not per minute of audio. To extrapolate to real-world usage: a 4-minute track at 44100Hz would take roughly 34x as long as the per-clip time shown. These are single-run numbers; actual throughput varies with background system load.

Model Quality and Speed

SDR values in dB (median across 50 tracks). Time is average seconds per 7-second clip on Apple M4 MPS.

Model	Vocals SDR	Drums SDR	Bass SDR	Other SDR	Mean SDR	Avg Time (s)
HTDemucs (base)	8.86	10.52	9.56	4.56	8.38	1.8s
HTDemucs FT (fine-tuned)	8.8	10.82	9.4	3.9	8.23	6.4s
Hybrid Demucs MMI (extended data)	8.78	10.02	8.43	4.24	7.87	3.3s
HTDemucs 6S (6-stem)	8.53	10.04	8.52	-3.42	5.92	1.6s