Speech · Vision · Voice — live today, pay-per-use
The audio and vision APIs that agents reach for.
Whisper and Qwen3-ASR for speech-to-text, GLM-OCR and Mistral OCR for document vision, and four TTS engines led by Chatterbox and Qwen3-TTS — both with zero-shot voice cloning, both beating ElevenLabs on blind tests and fidelity benchmarks. All OpenAI-compatible endpoints, all hosted next to our LLM inference, all priced near the open-router market.
Why this exists
Speech, vision, and voice should be the same provider as your LLM.
Live today
No founding cohort, no waitlist. Pay-per-use against the published rates below. Sign up, get a key, start calling.
OpenAI-compatible
Endpoints under /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/voice-clones, /v1/images/* — drop-in for any OpenAI-compatible client.
Open-weight models
Every model is open-weight and commercially licensed. We do not silently route to a closed-source backend.
Hosted next to LLMs
Audio and vision calls share the same Kathmandu private network as our LLM inference — one round trip when you chain them.
Catalog
One STT, one OCR, four TTS, two voice-cloning models.
Six models across three modalities. Each gets its own detail page with sample code, language coverage, and limits.
Speech-to-text
Whisper Large v3
Multilingual speech-to-text — 99 languages, including Nepali.
See Whisper Large v3 → →Qwen3-ASR
Apache 2.0 STT with 52-language coverage — the wider-language alternative to Whisper.
See Qwen3-ASR → →Document vision · OCR
Mistral OCR
Mistral's flagship document understanding — equations, tables, layout, multilingual. Via Mistral partnership.
See Mistral OCR → →Text-to-speech + voice cloning
Qwen3-TTS
Alibaba's open-weight voice cloning model. 3-second clones, 97 ms latency, Apache 2.0.
See Qwen3-TTS → →Chatterbox
The headline TTS — premium quality, emotion control, zero-shot voice cloning. MIT-licensed.
See Chatterbox → →Piper
Open-weight, efficient TTS — multilingual, CPU-friendly, cheapest in the catalog.
See Piper → →Voice cloning API
Clone a voice with a 5–10 second sample, then use it like any other TTS voice.
The /v1/audio/voice-clones
endpoint accepts a short reference sample, validates it (no public figures, written consent
attestation required), and returns a voice_id
you can pass to the standard /v1/audio/speech endpoint.
Built on Chatterbox (MIT, Resemble AI). Every generated output carries an imperceptible PerTh watermark identifying it as machine-generated — abuse posture written into the product, not the policy.
See Chatterbox + cloningPricing
Eight models. Eight honest prices.
Each unit is the natural unit for that modality — STT bills per hour of audio, OCR per page, TTS per million characters. No surprise per-token tail.
| Model | Modality | License | Languages | Pricing | Voice cloning |
|---|---|---|---|---|---|
| Whisper Large v3 | STT | MIT (OpenAI Whisper open release) | 99 | EUR 0.2 / hour of audio | — |
| GLM-OCR | OCR | Apache 2.0 (Zhipu / GLM open release) | — | EUR 0.0015 / page | — |
| Qwen3-ASR | STT | Apache 2.0 | 52 | EUR 0.15 / hour of audio | — |
| Mistral OCR | OCR | Mistral commercial license (proxied via la Plateforme) — OCR-25-03 weights on request | 11+ (multilingual) | EUR 0.0025 / page | — |
| Qwen3-TTS | TTS | Apache 2.0 | 10 | EUR 12 / million characters | Yes |
| Kokoro | TTS | Apache 2.0 | 8 | EUR 3 / million characters | No |
| Chatterbox | TTS | MIT (Resemble AI), built-in PerTh neural watermarking | 23 | EUR 18 / million characters | Yes |
| Piper | TTS | MIT | 30 | EUR 1.5 / million characters | No |
Priced near the open-router market, not under it.
We could undercut OpenAI, Deepgram, and Cartesia by 60–80 % on these models — the Kathmandu cost base would let us — but we don't. STT at EUR 0.20/hour sits between OpenAI Whisper ($0.36) and Deepgram batch ($0.26). Chatterbox at EUR 18/M chars is a third of Cartesia Sonic 3 ($35/M) and a tenth of ElevenLabs premium. Cheap enough to be the easy choice on cost; expensive enough that the hosting economics work and the service stays reliable. The differentiator is jurisdiction, license, and posture — not a discount race.
Utility safety
Audio in, audio out — handled like infrastructure.
Voice cloning is dangerous if shipped without guardrails. We bake consent, watermarking, and abuse circuit breakers into the API itself instead of bolting them onto a terms-of-service page.
Get utility API access
No deposit. No waitlist. Tell us which APIs you need.
The utility APIs are live and pay-per-use. Drop your details, pick the modality you care about most, and we'll send your API key and endpoint URLs within one working day.
- Whisper + Qwen3-ASR (STT), GLM-OCR + Mistral OCR, Chatterbox / Qwen3-TTS / Kokoro / Piper (TTS) — all live today.
- Straight pay-per-use against the rates above — no minimums, no commitment.
- Voice cloning requires a written consent attestation at clone-creation time.
- NPR, EUR, or USD invoicing — your finance team's currency.
Practical questions
Before you point your SDK at our endpoint.
How does pricing compare to OpenAI, ElevenLabs, Deepgram, and Cartesia?
Whisper at EUR 0.20/hour sits roughly between OpenAI Whisper ($0.36/hour) and Deepgram batch ($0.26/hour). GLM-OCR at EUR 0.0015/page matches Google Vision and AWS Textract basic OCR. Chatterbox at EUR 18/M chars is about a third of Cartesia Sonic 3 ($35/M) and roughly a tenth of ElevenLabs premium tier ($180–$300/M). Cheap enough to be the obvious choice on cost, expensive enough that the hosting economics work — not a discount race.
How does voice cloning consent work?
When you create a voice clone, the API requires a consent attestation: who provided the sample, when consent was given, and an acknowledgement that the voice is yours or licensed. We reject samples that match known public figures (politicians, celebrities, well-known media personalities). Tenants who repeatedly attempt to clone non-consented voices get suspended.
Why is there a watermark on Chatterbox output?
Chatterbox includes PerTh neural watermarking from Resemble AI — an imperceptible signal embedded in the audio that identifies it as machine-generated. This is industry best practice for responsible voice synthesis. The watermark does not affect audio quality.
How does this relate to Voicebox.sh?
Voicebox.sh is Jamie Pine's MIT-licensed local desktop app that bundles seven TTS engines, with Qwen3-TTS as its primary one. It is excellent for laptop-side prototyping and voice production. We host the same Qwen3-TTS engine — plus Chatterbox, Kokoro, and Piper — server-side on our Kathmandu GPUs, so the natural workflow is prototype locally in Voicebox.sh, then point your production code at our hosted API for scale.
What about Voxtral or XTTS?
Voxtral (Mistral) is open-weight but CC-BY-NC 4.0 — commercial hosting requires a separate agreement with Mistral. We are talking to them. XTTS v2 sits under the Coqui Public Model License, which similarly blocks commercial public-API hosting. Neither model is in our catalog today; we will add them only when the licensing path is clean.
Can I chain these with the LLM catalog?
Yes — that is the point. Audio in → Whisper transcription → LLM → Chatterbox audio out → all on one Kathmandu round trip. We see this most in voice agents, IVR replacements, voice-note-to-task pipelines, and multilingual content workflows.
What about real-time / streaming speech?
Whisper supports streaming transcription on the WebSocket endpoint. Chatterbox and Kokoro support streaming TTS over chunked HTTP, with Kokoro recommended for the lowest first-byte latency.
Is there a free tier?
No persistent free tier — utilities are pay-per-use. We can usually arrange a small trial allowance for serious POCs; talk to us when you sign up.