Text-to-speech · TTS
Kokoro
Small, fast TTS — built for low-latency agent loops.
Kokoro is a small (82M-parameter) Apache 2.0 TTS model that punches above its size for first-byte latency. The right choice when you want voice output that streams quickly and cheaply, and you can trade some prosody for speed.
When to pick Kokoro
- Latency budget is tight (chat-style streaming, IVR call legs)
- Volume is high enough that the per-character cost difference matters
- You do not need voice cloning or emotion control
- English is the dominant language in your workload
When to pick something else
- Voice cloning → Chatterbox or Qwen3-TTS
- Sub-100 ms first-byte with cloning → Qwen3-TTS
- Audiobook-quality narration → Chatterbox
- Languages Kokoro does not cover → Piper or Chatterbox
Pricing
EUR 1 per million characters. 8× cheaper than Chatterbox.
Limits
Same operational limits as Chatterbox — 50 concurrent generations per tenant, 50,000 characters per request, output as mp3 / wav / flac / opus.
Best for
- Chat-style agents that stream voice replies in real time
- Low-latency IVR where first-byte time matters more than fidelity
- High-volume voice notification pipelines
- Edge or embedded deployments where Chatterbox is too heavy
Upstream source: huggingface.co/hexgrad/Kokoro-82M
Request Kokoro access
Get an API key.
Straight pay-per-use against the published rate. No deposit, no minimums. Tell us what you're building and we'll send your API key and endpoint URL within one working day.