HeyGen voice cloning works — with two caveats most articles skip entirely. The first: HeyGen’s voice cloning is powered by ElevenLabs under the hood, which means you’re getting ElevenLabs quality inside a HeyGen workflow. The second: the output ranges from “close but slightly robotic” to “sounds genuinely like you” depending on which of the three cloning options you use, how you record your sample, and whether you know about the accent problem.
This article covers how it actually works, what each quality tier sounds like, which plan you need, and the specific recording mistakes that produce a version of your voice with a surprise British accent. *(It happens more than HeyGen’s marketing suggests.)*
Before walking through how the feature works, here’s the decision map every article in this space should show and none of them do. Your answer to the first question determines your entire setup path.
The diagram answers the question most people don’t know to ask. HeyGen voice cloning is not the right tool if your primary goal is pure audio output — for that, ElevenLabs at $5/month is a materially better deal for an equivalent quality tier. HeyGen’s voice cloning is the right tool specifically when the voice and the avatar are part of the same video workflow.
You upload a recording of your voice. HeyGen’s system — which runs on ElevenLabs’ voice engine under the hood — analyzes your pitch, rhythm, accent, and speech patterns to build a personalized voice model. That model is stored in your Voice Library. Every time you type a script in HeyGen’s AI Studio, the avatar speaks it in your cloned voice.
The ElevenLabs connection is worth naming clearly, because it explains the quality ceiling. HeyGen’s Instant Voice Clone is essentially ElevenLabs’ basic cloning tier, embedded in HeyGen’s interface. If you’ve ever tested ElevenLabs directly and found the voice quality there, that’s approximately what you’re getting inside HeyGen — with the advantage that your avatar and your voice live in the same platform.
HeyGen’s voice cloning is powered by ElevenLabs under the hood. HeyGen’s own community documentation confirms this. For users who have already invested in a Professional Voice Clone on ElevenLabs, HeyGen lets you import it directly — so you can use your higher-quality ElevenLabs voice with your HeyGen avatar.
Once cloned, the voice does three things. It speaks any script you type — no re-recording needed when content changes. It speaks in multiple languages while preserving your voice characteristics. And it can be assigned as the default for any avatar, making your entire video series consistent without you ever opening a microphone again.
There is no single “HeyGen voice cloning.” There are three distinct options with meaningfully different quality outputs, audio requirements, and plan requirements. Most articles treat them as one thing. They’re not.
For most creators, Instant Voice Clone is the right starting point. It takes minutes, requires no additional subscriptions, and the output quality is sufficient for corporate video, training content, and product explainers where you’re not being compared to the real speaker side-by-side.
Professional Voice Clone via ElevenLabs import makes sense if you already have a paid ElevenLabs account, or if your content puts the voice under close scrutiny — client-facing demos, spokesperson content, or anything where “close but slightly robotic” would undermine trust.
The setup is fast — under fifteen minutes if you have a decent microphone and a quiet room. The mistakes that produce bad quality are all in the recording, not in HeyGen’s interface. Here’s the complete process:
Use a USB microphone, a Bluetooth mic, or a modern smartphone. Avoid your laptop’s built-in microphone — HeyGen’s own documentation flags this specifically, and for good reason. A laptop mic captures fan noise, keyboard vibration, and room echo that the cloning model cannot fully remove.
Record in the quietest room you have. Speak naturally, not flatly — voice cloning tends to flatten tone, so if you record neutral, you get very neutral back. Include natural pauses, vary your pacing, and speak as if you’re telling someone something interesting.
The minimum is 30 seconds. One to two minutes produces noticeably better results. More than three minutes does not meaningfully improve the output.
In HeyGen’s AI Studio, go to Voice → New Voice → Create New Voice → Instant Voice Cloning. You’ll see a consent agreement confirming that your recording will be used to build a synthetic voice. Check it, then upload your audio file.
HeyGen processes the audio and generates your clone almost immediately. Listen to a test render before applying it to anything important — the first version tells you whether you need Voice Doctor.
If the first render sounds flat, slightly robotic, or off-accent, open Voice Doctor from your voice library. Describe what you want adjusted in natural language — “warmer tone,” “less robotic resonance,” “slower pacing” — and it generates improved versions without requiring a new recording.
For English content, Turbo v2 consistently produces better results than Auto in community testing. For non-English content, Multilingual v2 is the correct engine — using Turbo v2 on non-English scripts causes pronunciation errors.
In your Avatar library, click the three-dot menu → Set Primary Voice. Choose your clone. From this point, every video you generate with that avatar uses your cloned voice automatically — you never have to set it again per project.
You can also assign different voices to different script cards within the same video, which is useful if you have a multi-speaker format.
Multiple independent testers rated HeyGen voice cloning quality at 3 out of 5 — identical to ElevenLabs at equivalent tier pricing, which makes sense given the shared engine. That rating is not a condemnation. It means the clone sounds close but not identical.
The specific quality signature: pitch and overall tone are recognisably yours. The natural variation between syllables — the micro-pacing and micro-emphasis that make human speech feel alive — is slightly flattened. On shorter scripts, most listeners don’t notice. On longer passages, the AI consistency starts to read as monotone.
For most corporate video, training content, and product explainers, Instant Clone quality is sufficient. For client-facing spokesperson content where your audience knows your voice well, the flatness becomes detectable.
The multilingual output is where HeyGen’s voice cloning produces its most impressive results. Hearing your voice speak fluent Spanish from a recording you made in English is genuinely striking — the accent characteristics carry across, not just the vocabulary. For content teams producing multilingual training or product videos at scale, this single capability often ends the evaluation.
One independent tester uploaded a clear American English voice sample and received a clone that spoke with a British accent. This is a documented, recurring issue — not an edge case. The HeyGen community troubleshooting forum has multiple active threads on accent drift.
Why it happens: when your recording sample doesn’t contain enough distinctive accent markers, the ElevenLabs model underlying HeyGen’s clone fills in gaps from training data that skews toward neutral or British-accented English. A recording that sounds American to a human listener may not give the model enough data to confidently lock the accent.
Step 1: Open Voice Doctor from your voice library, select the affected voice, choose Enhance Voice, and describe the accent correction in natural language — “American English, not British.” The system generates corrected versions without a new recording. Step 2: If that doesn’t resolve it, re-record with deliberate accent emphasis — include words that distinctly mark your regional speech and avoid words pronounced identically across accents. Step 3: For persistent accent issues or distinctive regional accents, switch to Professional Voice Cloning via ElevenLabs with a longer sample. HeyGen’s own documentation recommends this for users with unique accents rather than iterating on Instant Clone.
Voice cloning is not available on the free plan. The free plan gives you HeyGen’s library of 300+ stock AI voices, but you cannot create a custom clone of your own voice. This is a common source of frustration — the onboarding lets you configure your avatar before revealing that the voice you want requires paying.
Creator plan ($29/month, or $24/month annually) is the minimum tier for Instant Voice Cloning. It gives you one custom voice clone included in the plan. Additional clone slots are $29/month per slot.
If you want to import a Professional Voice Clone from ElevenLabs, you need an active ElevenLabs paid plan (from $22/month) in addition to Creator. Budget $46–53/month total for the PVC route. The two subscriptions run separately — there is no bundle pricing.
The full breakdown of what each HeyGen plan includes — including the Premium Credits that cover Avatar IV and translation on top of voice cloning — is in the HeyGen pricing guide.
Test HeyGen’s Instant Voice Clone on your actual content type before deciding whether Professional Voice Cloning is worth the additional ElevenLabs subscription. For most structured corporate and training content, Instant Clone is sufficient. The upgrade to PVC makes sense when you’ve tested Instant Clone and found a specific quality gap it cannot resolve.
Both HeyGen and Descript offer voice cloning, and both come up in searches for AI voice tools. They are solving different problems at different points in a production workflow.
HeyGen voice cloning is built for avatar video production. Your cloned voice drives a digital presenter that speaks any language in your face and voice. The use case is scaling video content without camera time — training videos, product explainers, multilingual marketing. The voice and the avatar are the same feature.
Descript Overdub is built for correcting recorded audio without re-recording. You’ve already filmed, you said the wrong date, and you want to fix it without going back to the microphone. Overdub generates your corrected voice in context. It’s a post-production correction tool, not a video generation tool.
HeyGen creates video from scratch using a voice clone. Descript uses a voice clone to fix existing video. If you haven’t filmed anything yet and want avatar-led content at scale, HeyGen. If you film yourself and want a correction layer, Descript.
The one scenario where they genuinely compete: a creator building a faceless YouTube channel who is deciding whether to appear on camera with Descript-corrected audio or use a HeyGen avatar with a cloned voice instead. That decision comes down to whether your audience expects a human face — not which tool is better.
The Descript review covers Overdub’s quality and limitations in detail — including the vocabulary cap on Creator plan that produces nonsense audio for unrecognised words, which is the voice cloning failure mode Descript users hit most often.
You recorded your voice once. From that point, HeyGen’s avatar speaks every script you type — in English, Spanish, Mandarin, or all three — in your voice, with your face, without you ever sitting in front of a camera again. That’s the actual product. The free plan is enough to verify the avatar quality before you pay for any of it.
Voice cloning requires Creator plan ($29/mo). Free plan includes 300+ stock AI voices and 3 watermarked videos per month. Start there — upgrade only when you’ve confirmed the quality works for your content.
Pricing alerts, honest scores, new reviews. One email a week. No hype. Free.
No spam. Unsubscribe any time.