Text-to-speech · TTS+CLONE
Qwen3-TTS
Alibaba's open-weight voice cloning model. 3-second clones, 97 ms latency, Apache 2.0.
Qwen3-TTS is Alibaba’s open-weight TTS family — released Apache 2.0 in January 2026 — designed around two specific advantages over the rest of the open-source TTS field: extremely short clone samples (3 seconds is enough) and extremely low first-byte latency (97 ms). It is the primary engine behind the Voicebox.sh local desktop app, and we host it server-side for the same reason: when latency or clone-speed matters more than language breadth, this is what you reach for.
Voice cloning
The same /v1/audio/voice-clones endpoint as Chatterbox; the difference is
the model parameter you pass to /v1/audio/speech. Qwen3-TTS accepts
samples as short as 3 seconds — useful when you only have a brief reference
clip to work from. Cloned voices benchmark at ~0.79 fidelity, ahead of
ElevenLabs (~0.75) and MiniMax (~0.72) on equivalent tests.
curl https://api.scalabs.cloud/v1/audio/speech \
-H "Authorization: Bearer $SCL_KEY" \
-H "Content-Type: application/json" \
-d '{ "model": "qwen3-tts", "voice": "vc_01H...", "input": "Hello world." }' \
> out.mp3
Chatterbox or Qwen3-TTS — how to choose
| Need | Pick |
|---|---|
| Lowest latency, real-time streaming | Qwen3-TTS (97 ms first byte) |
| Shortest clone sample | Qwen3-TTS (3 seconds) |
| Widest language coverage | Chatterbox (23 vs 10 languages) |
| Built-in watermark for abuse-safety | Chatterbox (PerTh) |
| Audiobook-quality narration | Chatterbox (slight edge on long-form prosody) |
| Already prototyping with Voicebox.sh locally | Qwen3-TTS (same engine, server-scaled) |
In practice we recommend Chatterbox as the default and Qwen3-TTS when the latency or 3-second-clone advantage matters specifically.
Pricing
EUR 5 per million characters. Sits between Chatterbox (EUR 8) and Kokoro (EUR 1) — premium quality, premium latency, mid-catalog price.
Limits
- Per-tenant rate limit: 50 concurrent generations
- Per-output limit: 50,000 characters
- Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Output formats: mp3, wav, flac, opus
Pairs well with Voicebox.sh
If you’re already using Voicebox.sh — the MIT-licensed local desktop app that bundles Qwen3-TTS as its primary engine — you can keep that local workflow for laptop-side prototyping and point your production code at our hosted API for server-side scale. Same underlying engine; one runs locally on your laptop, the other runs on our Kathmandu GPUs.
Best for
- Voice cloning from very short reference samples (3 seconds)
- Latency-sensitive voice agents and IVR (sub-100 ms first-byte)
- Real-time streaming voice replies
- Multilingual voice across Chinese, English, Japanese, Korean + 6 European languages
Upstream source: github.com/QwenLM/Qwen3-TTS
Request Qwen3-TTS access
Get an API key.
Straight pay-per-use against the published rate. No deposit, no minimums. Tell us what you're building and we'll send your API key and endpoint URL within one working day.