Text-to-speech · TTS+CLONE

Qwen3-TTS

Alibaba's open-weight voice cloning model. 3-second clones, 97 ms latency, Apache 2.0.

Qwen3-TTS is Alibaba’s open-weight TTS family — released Apache 2.0 in January 2026 — designed around two specific advantages over the rest of the open-source TTS field: extremely short clone samples (3 seconds is enough) and extremely low first-byte latency (97 ms). It is the primary engine behind the Voicebox.sh local desktop app, and we host it server-side for the same reason: when latency or clone-speed matters more than language breadth, this is what you reach for.

Voice cloning

The same /v1/audio/voice-clones endpoint as Chatterbox; the difference is the model parameter you pass to /v1/audio/speech. Qwen3-TTS accepts samples as short as 3 seconds — useful when you only have a brief reference clip to work from. Cloned voices benchmark at ~0.79 fidelity, ahead of ElevenLabs (~0.75) and MiniMax (~0.72) on equivalent tests.

curl https://api.scalabs.cloud/v1/audio/speech \
  -H "Authorization: Bearer $SCL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "model": "qwen3-tts", "voice": "vc_01H...", "input": "Hello world." }' \
  > out.mp3

Chatterbox or Qwen3-TTS — how to choose

NeedPick
Lowest latency, real-time streamingQwen3-TTS (97 ms first byte)
Shortest clone sampleQwen3-TTS (3 seconds)
Widest language coverageChatterbox (23 vs 10 languages)
Built-in watermark for abuse-safetyChatterbox (PerTh)
Audiobook-quality narrationChatterbox (slight edge on long-form prosody)
Already prototyping with Voicebox.sh locallyQwen3-TTS (same engine, server-scaled)

In practice we recommend Chatterbox as the default and Qwen3-TTS when the latency or 3-second-clone advantage matters specifically.

Pricing

EUR 5 per million characters. Sits between Chatterbox (EUR 8) and Kokoro (EUR 1) — premium quality, premium latency, mid-catalog price.

Limits

  • Per-tenant rate limit: 50 concurrent generations
  • Per-output limit: 50,000 characters
  • Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Output formats: mp3, wav, flac, opus

Pairs well with Voicebox.sh

If you’re already using Voicebox.sh — the MIT-licensed local desktop app that bundles Qwen3-TTS as its primary engine — you can keep that local workflow for laptop-side prototyping and point your production code at our hosted API for server-side scale. Same underlying engine; one runs locally on your laptop, the other runs on our Kathmandu GPUs.

Best for

  • Voice cloning from very short reference samples (3 seconds)
  • Latency-sensitive voice agents and IVR (sub-100 ms first-byte)
  • Real-time streaming voice replies
  • Multilingual voice across Chinese, English, Japanese, Korean + 6 European languages

Upstream source: github.com/QwenLM/Qwen3-TTS