Text-to-speech · TTS+CLONE
Chatterbox
The headline TTS — premium quality, emotion control, zero-shot voice cloning. MIT-licensed.
Chatterbox is the open-source TTS model from Resemble AI. In side-by-side blind tests it is consistently preferred over ElevenLabs (~63–65% of listeners pick Chatterbox). It is MIT-licensed — so unlike most premium voice models, it is genuinely free to use commercially.
Chatterbox is our headline TTS model and the foundation of our voice-cloning API.
Voice cloning
The /v1/audio/voice-clones endpoint takes a 5–10 second reference sample
(your own voice or any voice you have a license to use), validates it, and
returns a voice_id you can then pass to /v1/audio/speech like any other
voice.
curl https://api.scalabs.cloud/v1/audio/voice-clones \
-H "Authorization: Bearer $SCL_KEY" \
-F "name=Anuj reading book chapters" \
-F "sample=@./anuj-30sec.wav" \
-F "consent_signed_by=Anuj Sharma" \
-F "consent_date=2026-05-22"
# → { "voice_id": "vc_01H...", "status": "ready" }
Then to use it:
curl https://api.scalabs.cloud/v1/audio/speech \
-H "Authorization: Bearer $SCL_KEY" \
-H "Content-Type: application/json" \
-d '{ "model": "chatterbox", "voice": "vc_01H...", "input": "Hello world." }' \
> out.mp3
The model has built-in PerTh neural watermarking — every output carries an imperceptible signal that identifies it as machine-generated. This makes the API safe to expose without enabling deepfake abuse.
Consent and abuse posture
Voice cloning is gated by a written consent attestation at clone-creation time. We do not let tenants create voices from celebrity samples, public figures, or anyone they cannot demonstrate consent for. Clones that violate this are removed and the tenant is warned, then suspended on repeat.
When to pick it
- Voice agents and IVR systems where the voice needs to feel natural
- Audiobook and long-form content generation with a consistent narrator
- Voice cloning for brand voices (with proper consent)
- Anywhere you would otherwise reach for ElevenLabs but want open-weight
- commercial-friendly licensing + on-shore hosting
Pricing
- EUR 18 per million characters for TTS generation. About a third of Cartesia Sonic 3 ($35/M) and roughly a tenth of ElevenLabs premium tier ($180–$300/M depending on plan).
- Voice clone creation: free. 5 active clones per tenant included; more on request.
- Voice clone storage: free.
Limits
- Per-tenant rate limit: 50 concurrent generations
- Per-output limit: 50,000 characters (chunk longer text yourself)
- Output formats: mp3, wav, flac, opus
Other TTS in our catalog and on our radar
- Qwen3-TTS — also hosted on our GPUs. Pick it when 3-second clones or sub-100 ms first-byte latency matter more than the wider language coverage / built-in watermark Chatterbox gives you.
- Voicebox.sh — Jamie Pine’s local desktop app (MIT). Bundles Qwen3-TTS as its primary engine plus six others; great for laptop-side prototyping. Our hosted Qwen3-TTS is the server-side path.
- Voxtral 4B (Mistral) — strong open-weight TTS, but CC-BY-NC 4.0 forces a separate commercial agreement with Mistral. Not hosted yet; talking to Mistral about terms.
- XTTS v2 — voice cloning in 6 seconds, but Coqui Public Model License restricts commercial hosting. Not viable as a public API.
Best for
- Voice cloning APIs — clone a target voice from a 5–10 second sample
- Voice agents and IVR that need natural prosody and emotion
- Audiobook-scale text-to-speech with consistent voice
- Multilingual voice output across 23 languages
Upstream source: github.com/resemble-ai/chatterbox
Request Chatterbox access
Get an API key.
Straight pay-per-use against the published rate. No deposit, no minimums. Tell us what you're building and we'll send your API key and endpoint URL within one working day.