Text-to-speech · TTS

Kokoro

Small, fast TTS — built for low-latency agent loops.

Kokoro is a small (82M-parameter) Apache 2.0 TTS model that punches above its size for first-byte latency. The right choice when you want voice output that streams quickly and cheaply, and you can trade some prosody for speed.

When to pick Kokoro

  • Latency budget is tight (chat-style streaming, IVR call legs)
  • Volume is high enough that the per-character cost difference matters
  • You do not need voice cloning or emotion control
  • English is the dominant language in your workload

When to pick something else

Pricing

EUR 1 per million characters. 8× cheaper than Chatterbox.

Limits

Same operational limits as Chatterbox — 50 concurrent generations per tenant, 50,000 characters per request, output as mp3 / wav / flac / opus.

Best for

  • Chat-style agents that stream voice replies in real time
  • Low-latency IVR where first-byte time matters more than fidelity
  • High-volume voice notification pipelines
  • Edge or embedded deployments where Chatterbox is too heavy

Upstream source: huggingface.co/hexgrad/Kokoro-82M