Text-to-speech · TTS+CLONE

Chatterbox

The headline TTS — premium quality, emotion control, zero-shot voice cloning. MIT-licensed.

Request Chatterbox access Back to catalog

Chatterbox is the open-source TTS model from Resemble AI. In side-by-side blind tests it is consistently preferred over ElevenLabs (~63–65% of listeners pick Chatterbox). It is MIT-licensed — so unlike most premium voice models, it is genuinely free to use commercially.

Chatterbox is our headline TTS model and the foundation of our voice-cloning API.

Voice cloning

The /v1/audio/voice-clones endpoint takes a 5–10 second reference sample (your own voice or any voice you have a license to use), validates it, and returns a voice_id you can then pass to /v1/audio/speech like any other voice.

curl https://api.scalabs.cloud/v1/audio/voice-clones \
  -H "Authorization: Bearer $SCL_KEY" \
  -F "name=Anuj reading book chapters" \
  -F "sample=@./anuj-30sec.wav" \
  -F "consent_signed_by=Anuj Sharma" \
  -F "consent_date=2026-05-22"

# → { "voice_id": "vc_01H...", "status": "ready" }

Then to use it:

curl https://api.scalabs.cloud/v1/audio/speech \
  -H "Authorization: Bearer $SCL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "model": "chatterbox", "voice": "vc_01H...", "input": "Hello world." }' \
  > out.mp3

The model has built-in PerTh neural watermarking — every output carries an imperceptible signal that identifies it as machine-generated. This makes the API safe to expose without enabling deepfake abuse.

Voice cloning is gated by a written consent attestation at clone-creation time. We do not let tenants create voices from celebrity samples, public figures, or anyone they cannot demonstrate consent for. Clones that violate this are removed and the tenant is warned, then suspended on repeat.

When to pick it

Voice agents and IVR systems where the voice needs to feel natural
Audiobook and long-form content generation with a consistent narrator
Voice cloning for brand voices (with proper consent)
Anywhere you would otherwise reach for ElevenLabs but want open-weight
- commercial-friendly licensing + on-shore hosting

Pricing

EUR 18 per million characters for TTS generation. About a third of Cartesia Sonic 3 ($35/M) and roughly a tenth of ElevenLabs premium tier ($180–$300/M depending on plan).
Voice clone creation: free. 5 active clones per tenant included; more on request.
Voice clone storage: free.

Limits

Per-tenant rate limit: 50 concurrent generations
Per-output limit: 50,000 characters (chunk longer text yourself)
Output formats: mp3, wav, flac, opus

Other TTS in our catalog and on our radar

Qwen3-TTS — also hosted on our GPUs. Pick it when 3-second clones or sub-100 ms first-byte latency matter more than the wider language coverage / built-in watermark Chatterbox gives you.
Voicebox.sh — Jamie Pine’s local desktop app (MIT). Bundles Qwen3-TTS as its primary engine plus six others; great for laptop-side prototyping. Our hosted Qwen3-TTS is the server-side path.
Voxtral 4B (Mistral) — CC-BY-NC 4.0 currently blocks commercial public-API hosting.
XTTS v2 — voice cloning in 6 seconds, but Coqui Public Model License restricts commercial hosting. Not viable as a public API.

Best for

Voice cloning APIs — clone a target voice from a 5–10 second sample
Voice agents and IVR that need natural prosody and emotion
Audiobook-scale text-to-speech with consistent voice
Multilingual voice output across 23 languages

Upstream source: github.com/resemble-ai/chatterbox

Continue in the ScaLabs Cloud Console

We'll create your account and email you a 6-digit sign-in code. Finish the request inside the console.