Methodology
All four publicly available HTDemucs variants were evaluated on the full 50-track MUSDB18-7s test set. Each track was processed independently per model and the wall-clock time recorded. SDR was computed against ground truth stems using BSSEval v4 via mir_eval.
Models evaluated:
htdemucs– base hybrid transformer-convolutional modelhtdemucs_ft– fine-tuned on additional data, generally higher SDR on standard benchmarkshtdemucs_6s– 6-stem variant (adds guitar and piano stems); only standard 4-stem SDR reported for comparabilityhdemucs_mmi– the older Hybrid Demucs (non-transformer) variant trained with multi-mirror input on extra data
Device: Apple M4 MPS. Models run at default segment size.
Notes on speed numbers
Times are per 7-second clip, not per minute of audio. To extrapolate to real-world usage: a 4-minute track at 44100Hz would take roughly 34x as long as the per-clip time shown. These are single-run numbers; actual throughput varies with background system load.