Text-to-speech · TTS+CLONE

Qwen3-TTS

Alibaba's open-weight voice cloning model. 3-second clones, 97 ms latency, Apache 2.0.

Request Qwen3-TTS access Back to catalog

Qwen3-TTS is Alibaba’s open-weight TTS family — released Apache 2.0 in January 2026 — designed around two specific advantages over the rest of the open-source TTS field: extremely short clone samples (3 seconds is enough) and extremely low first-byte latency (97 ms). It is the primary engine behind the Voicebox.sh local desktop app, and we host it server-side for the same reason: when latency or clone-speed matters more than language breadth, this is what you reach for.

Voice cloning

The same /v1/audio/voice-clones endpoint as Chatterbox; the difference is the model parameter you pass to /v1/audio/speech. Qwen3-TTS accepts samples as short as 3 seconds — useful when you only have a brief reference clip to work from. Cloned voices benchmark at ~0.79 fidelity, ahead of ElevenLabs (~0.75) and MiniMax (~0.72) on equivalent tests.

curl https://api.scalabs.cloud/v1/audio/speech \
  -H "Authorization: Bearer $SCL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "model": "qwen3-tts", "voice": "vc_01H...", "input": "Hello world." }' \
  > out.mp3

Chatterbox or Qwen3-TTS — how to choose

Need	Pick
Lowest latency, real-time streaming	Qwen3-TTS (97 ms first byte)
Shortest clone sample	Qwen3-TTS (3 seconds)
Widest language coverage	Chatterbox (23 vs 10 languages)
Built-in watermark for abuse-safety	Chatterbox (PerTh)
Audiobook-quality narration	Chatterbox (slight edge on long-form prosody)
Already prototyping with Voicebox.sh locally	Qwen3-TTS (same engine, server-scaled)

In practice we recommend Chatterbox as the default and Qwen3-TTS when the latency or 3-second-clone advantage matters specifically.

Pricing

EUR 5 per million characters. Sits between Chatterbox (EUR 8) and Kokoro (EUR 1) — premium quality, premium latency, mid-catalog price.

Limits

Per-tenant rate limit: 50 concurrent generations
Per-output limit: 50,000 characters
Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Output formats: mp3, wav, flac, opus

Pairs well with Voicebox.sh

If you’re already using Voicebox.sh — the MIT-licensed local desktop app that bundles Qwen3-TTS as its primary engine — you can keep that local workflow for laptop-side prototyping and point your production code at our hosted API for server-side scale. Same underlying engine; one runs locally on your laptop, the other runs on our GPUs.

Best for

Voice cloning from very short reference samples (3 seconds)
Latency-sensitive voice agents and IVR (sub-100 ms first-byte)
Real-time streaming voice replies
Multilingual voice across Chinese, English, Japanese, Korean + 6 European languages

Upstream source: github.com/QwenLM/Qwen3-TTS

Continue in the ScaLabs Cloud Console

We'll create your account and email you a 6-digit sign-in code. Finish the request inside the console.