Stem splitting is one of those things that sounds impossible until you actually try it. You drop a finished song into a tool, wait 30 seconds, and get back four separate files: just the vocals, just the drums, just the bass, just everything else. No studio access required. No original session files. Just a two-minute pop song turned into its component parts.
That’s the pitch, anyway. The reality is a little more nuanced, which is why this guide exists.
This is the complete picture: what stem splitting is, where it came from, how the AI actually works, what you can realistically expect from it, and how to choose the right tool for what you’re trying to do. If you’ve landed here without much background, start reading straight through. If you already know the basics, use the sections below to jump to what you need.
What stem splitting actually is (and what it isn’t)
A stem, in audio production, is a bounced mix of a subset of tracks from a session. A producer might send a vocalist a “stem” containing just the drums and bass, so they can record over it. A mixing engineer receives stems (drums, synths, vocals, FX) rather than hundreds of individual tracks. The word predates AI by decades.
What’s changed is the direction of travel. Traditional stems always flowed outward: you had the session, you bounced the stems, you shared them. AI stem splitting reverses this. You start with a finished stereo mix and work backward, trying to reconstruct what the individual elements sounded like before they were mixed together.
That reconstruction is genuinely difficult. When you mix drums, bass, vocals and guitars into a stereo file, you’re collapsing dozens of tracks into 2 channels. Frequencies overlap. Reverb from the snare smears into the vocal range. The kick and the bass guitar share almost the same frequency band. You can’t perfectly “un-mix” a track any more than you can un-bake a cake.
What AI does is make a very educated guess, and it turns out those guesses are often good enough to be genuinely useful.
This is different from remixing an isolated vocal because someone posted the a cappella online. It’s different from having access to the original multitrack stems from a record label. Stem splitting works on any song, from any era, with no cooperation required from whoever made it.
Where this technology came from
The idea of separating mixed audio into its sources is called Music Source Separation (MSS), and researchers have been working on it for a long time, see Wikipedia’s overview of stem separation for the academic history.
Early approaches, dating back to the 1990s and 2000s, relied on signal processing tricks: matrix stereo decoding, phase cancellation, non-negative matrix factorization. These worked poorly and were narrowly applicable. The “karaoke mode” on old amplifiers that removed center-panned vocals? That’s a phase cancellation trick, and it’s why the bass guitar often disappeared too.
The research community organized formal evaluation campaigns (SiSEC, the Signal Separation Evaluation Campaign) from around 2008 onward, which gave researchers a standardized benchmark for comparing their methods. Progress was slow.
Then deep learning arrived.
Around 2017, neural network approaches started dramatically outperforming everything before them. By training on multi-track studio recordings (specifically the MUSDB18 dataset, which contains 150 songs with separate stems), models learned what a kick drum “sounds like” in frequency terms versus what a bass guitar sounds like, and could start disentangling them. The research published by Meta AI’s team on Demucs represents one of the most significant milestones: a model that processes audio end-to-end using a waveform-based encoder-decoder architecture, later combined with spectrogram processing in HTDemucs.
The practical result was that quality crossed a threshold where the outputs were actually usable for music production purposes, not just as research curiosities.
How it works under the hood
The short version: the AI converts your audio into a spectrogram (a visual representation of frequency vs. time), then learns to produce a “mask” that separates each instrument’s contribution from the whole.
Your audio file, which looks like a waveform, actually contains thousands of overlapping frequencies happening simultaneously. A spectrogram makes those frequencies visible as a 2D image: time runs left to right, frequency runs bottom to top, and brightness shows how loud each frequency is at each moment. The kick drum appears as a cluster of bright pixels in the low-frequency region at each beat. The vocal appears as a shifting pattern in the mid-frequency range.
The neural network learns to identify those patterns from having seen (well, heard) thousands of examples of isolated instruments. During training, it sees both the mixed spectrogram and the target (e.g., “what just the drums look like”), and learns to predict a mask that, when applied to the mix, leaves only the drums behind.
The full technical breakdown goes deeper on spectrograms, time-frequency masking, and why some genres are much harder to separate than others. But the key intuition is: this is pattern recognition applied to sound.
Demucs (specifically HTDemucs, the current state of the art) uses a hybrid approach that processes both the raw waveform and the spectrogram simultaneously, combining information from both domains. That’s why it tends to outperform older models on complex material.
Quality is measured using Signal-to-Distortion Ratio (SDR), where a higher number means less of the other instruments are bleeding into your target stem.
What the output actually looks like
Standard stem splitting produces 4 outputs:
- Vocals (lead and backing vocals, whatever’s in the high-frequency melodic range)
- Drums (kick, snare, hi-hats, cymbals, percussion)
- Bass (bass guitar, sub-bass synthesizers)
- Other (everything else: guitars, keys, strings, synths, pads)
The “other” stem is a catch-all, and for dense arrangements it can be a muddy mix of things that don’t naturally belong together. This is one reason why going beyond 4 stems is sometimes valuable.
Some tools and models now go further, offering 4-stem vs 6-stem separation where guitar and piano are separated from the “other” bucket into their own tracks. Whether that extra granularity is useful depends heavily on what you’re doing with the stems.
The files come out as WAV or FLAC, usually at the same sample rate and bit depth as your input. If you put in a 44.1kHz/16-bit MP3 (which has already lost information from compression), you’ll get 44.1kHz/16-bit WAVs. The AI doesn’t recover information that was discarded during MP3 encoding.
Each stem is full stereo, matching the original track’s length exactly. You can import them directly into a DAW and the timing will be perfect.
What you can realistically expect
Here’s where honesty matters.
Stem splitting works best on music that was recorded and produced in a studio with clean separation between instruments in the frequency domain. A polished pop track, a well-produced hip-hop beat, a simple singer-songwriter recording: these tend to separate cleanly.
Dense orchestral music is genuinely hard. The violin section, the cellos, the brass, and the woodwinds all overlap massively in frequency space, and the AI struggles to disentangle them. Live recordings with room bleed are harder than studio recordings. Genres with a lot of reverb (shoegaze, lo-fi) are harder than dry productions.
You will get artifacts. This is not a flaw unique to one tool, it’s inherent to the problem. Bleed and artifacts are what you call it when pieces of one instrument show up faintly in another stem. The hi-hat might ghost into the vocal stem. A bit of bass might leak into the drums. How much this matters depends on what you’re doing with the output.
For karaoke tracks and practice purposes, minor bleed is irrelevant. For professional remixing work where you need pristine isolation, it may be a dealbreaker unless you spend time cleaning up the stems manually in your DAW.
The models available (Demucs, MDX-Net, HTDemucs) have meaningfully different strengths on different material. Using the best model for your specific genre and use case makes a real difference.
The main ways people use it
The range of applications is wider than most people expect when they first encounter this technology.
Remixing and production. The most obvious use. Grab the vocal from a released track and build a new instrumental around it. Or isolate the drum pattern from a song you love and study exactly how the groove was constructed. Sampling and beatmaking workflows have been transformed by the ability to cleanly (or at least cleanly enough) isolate individual elements without having to chop around them in a sample.
More specifically, see stem splitting for sampling and beatmaking for how producers actually integrate this into their workflow.
Karaoke. The simplest application: remove the vocals, keep everything else. Making a karaoke track is one of the most common reasons people try stem splitting for the first time. The quality is usually good enough for home use, especially on modern pop where the vocal is well-separated in the mix.
Learning songs by ear. Isolate just the bass to figure out a bassline. Pull out just the guitar to learn a chord progression without the vocal distracting you. Using stem splitting to learn by ear is one of the most genuinely underrated use cases, especially for instrumentalists.
DJ performance. Modern DJ tools have started integrating stem separation directly into their software so DJs can do real-time vocal swaps, drum replacements and instrumental blends between tracks. Stem splitting for DJs covers this in detail, including which DJ software has it built in.
Audio restoration. Less talked about but increasingly useful: separating stems to clean up problematic recordings, reduce noise, or rescue audio where one element is obscuring another. Stem separation for audio restoration covers the specific workflows involved.
Practicing with real tracks. Vocalists slowing down a song to learn it, horn players figuring out a solo, pianists trying to hear a comping part underneath a busy arrangement. The isolation that stem splitting provides makes this far easier than trying to pick out an instrument from a full mix.
Choosing your approach
There are three main axes along which stem splitting tools differ: online vs desktop, free vs paid, and standalone vs DAW-native.
Online tools run in a browser, require no installation, and work on any computer. The tradeoff is that you’re uploading your audio to someone’s servers, which matters if you’re working with unreleased music. Processing speed depends on server load. Online vs desktop stem splitters covers the full comparison.
Desktop tools run locally, which means your audio never leaves your machine. They’re generally faster once set up (especially with a good GPU), allow batch processing, and give you more control over which model you use. UVR5 (Ultimate Vocal Remover) is the most flexible free desktop option, supporting Demucs, MDX-Net and several other models.
Free vs paid is a real distinction. Free tools work, but paid tools often offer better models, faster processing, cleaner UX, and features like stem refinement and noise reduction. The free vs paid breakdown walks through exactly what you get at each price point.
DAW-native integration is becoming more common. Ableton Live 12’s built-in stem separation lets you separate a clip directly in the arrangement view without leaving the DAW. Logic Pro has had a vocal separation feature for several versions. These integrations are convenient, but the quality doesn’t always match dedicated tools.
For most people who want to try this without installing anything, StemSplit.io is the online tool to start with. It uses HTDemucs under the hood, processes quickly, and the interface is straightforward. Drag in a file, get 4 stems back.
For the full breakdown of what to use when, see online vs desktop stem splitters and free vs paid stem splitters.
Getting stems into your workflow
Once you’ve got your stems, you need to do something with them.
Bringing stems into your DAW covers the practical steps for importing, aligning and working with stem files in Ableton, Logic, FL Studio and other DAWs. There are a few gotchas with sample rate matching and track alignment that are worth knowing before you start.
If you specifically want to isolate vocals, see how to isolate vocals from a song. If drums are what you’re after, how to extract drum stems goes into the specific considerations for percussion isolation including model choice.
Where to go next
This guide is meant to be an entry point. Every section above has a corresponding deep-dive post:
Understanding the basics:
- What Are Audio Stems? — the term, the history, stems vs multitracks
- How AI Stem Separation Actually Works — spectrograms, neural networks, the technical detail
- Why Stem Splitters Aren’t Perfect — bleed, artifacts, what causes them
Choosing the right setup:
- 4-Stem vs 6-Stem Separation — when the extra granularity matters
- Demucs, MDX-Net, and HTDemucs — the main models compared
- Free vs Paid Stem Splitters — what the money gets you
- Online vs Desktop Stem Splitters — privacy, speed, control
Specific use cases:
- How to Isolate Vocals From a Song
- How to Extract Drum Stems
- How to Make a Karaoke Track
- Bringing Stems Into Your DAW
- Stem Splitting for DJs
- Stem Splitting for Sampling and Beatmaking
- Using Stem Splitting to Learn Songs by Ear
- AI Stem Separation for Audio Restoration
Have a specific question? The Stem Splitter FAQ covers the most common ones.