Text-to-speech · TTS

Kokoro

Small, fast TTS — built for low-latency agent loops.

Kokoro is a small (82M-parameter) Apache 2.0 TTS model that punches above its size for first-byte latency. The right choice when you want voice output that streams quickly and cheaply, and you can trade some prosody for speed.

When to pick Kokoro

Latency budget is tight (chat-style streaming, IVR call legs)
Volume is high enough that the per-character cost difference matters
You do not need voice cloning or emotion control
English is the dominant language in your workload

When to pick something else

Voice cloning → Chatterbox or Qwen3-TTS
Sub-100 ms first-byte with cloning → Qwen3-TTS
Audiobook-quality narration → Chatterbox
Languages Kokoro does not cover → Piper or Chatterbox

Pricing

EUR 1 per million characters. 8× cheaper than Chatterbox.

Limits

Same operational limits as Chatterbox — 50 concurrent generations per tenant, 50,000 characters per request, output as mp3 / wav / flac / opus.

Best for

Chat-style agents that stream voice replies in real time
Low-latency IVR where first-byte time matters more than fidelity
High-volume voice notification pipelines
Edge or embedded deployments where Chatterbox is too heavy

Upstream source: huggingface.co/hexgrad/Kokoro-82M

Continue in the ScaLabs Cloud Console

We'll create your account and email you a 6-digit sign-in code. Finish the request inside the console.