Speech · Vision · Voice — live today, pay-per-use

The audio and vision APIs that agents reach for.

Whisper and Qwen3-ASR for speech-to-text, GLM-OCR and Mistral OCR for document vision, and four TTS engines led by Chatterbox and Qwen3-TTS — both with zero-shot voice cloning, both beating ElevenLabs on blind tests and fidelity benchmarks. All OpenAI-compatible endpoints, all hosted next to our LLM inference, all priced near the open-router market.

OpenAI-compatible API Voice cloning included No payload logging PerTh watermark on TTS
utilities.stack Kathmandu GPU racks · OpenAI-compatible
Your agent / app Any OpenAI-compatible SDK or HTTP client
Utility gateway /v1/audio/* · /v1/images/* · /v1/audio/voice-clones
Whisper · GLM-OCR · Chatterbox · Qwen3-TTS · Kokoro · Piper open-weight models on our GPUs

Why this exists

Speech, vision, and voice should be the same provider as your LLM.

01

Live today

No founding cohort, no waitlist. Pay-per-use against the published rates below. Sign up, get a key, start calling.

02

OpenAI-compatible

Endpoints under /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/voice-clones, /v1/images/* — drop-in for any OpenAI-compatible client.

03

Open-weight models

Every model is open-weight and commercially licensed. We do not silently route to a closed-source backend.

04

Hosted next to LLMs

Audio and vision calls share the same Kathmandu private network as our LLM inference — one round trip when you chain them.

Catalog

One STT, one OCR, four TTS, two voice-cloning models.

Six models across three modalities. Each gets its own detail page with sample code, language coverage, and limits.

Speech-to-text

STT

Whisper Large v3

Multilingual speech-to-text — 99 languages, including Nepali.

See Whisper Large v3 → →
STT

Qwen3-ASR

Apache 2.0 STT with 52-language coverage — the wider-language alternative to Whisper.

See Qwen3-ASR → →

Document vision · OCR

OCR

GLM-OCR

Document vision for receipts, PDFs, screenshots, forms.

See GLM-OCR → →
OCR

Mistral OCR

Mistral's flagship document understanding — equations, tables, layout, multilingual. Via Mistral partnership.

See Mistral OCR → →

Text-to-speech + voice cloning

TTS+CLONE

Qwen3-TTS

Alibaba's open-weight voice cloning model. 3-second clones, 97 ms latency, Apache 2.0.

See Qwen3-TTS → →
TTS

Kokoro

Small, fast TTS — built for low-latency agent loops.

See Kokoro → →
TTS+CLONE

Chatterbox

The headline TTS — premium quality, emotion control, zero-shot voice cloning. MIT-licensed.

See Chatterbox → →
TTS

Piper

Open-weight, efficient TTS — multilingual, CPU-friendly, cheapest in the catalog.

See Piper → →

Voice cloning API

Clone a voice with a 5–10 second sample, then use it like any other TTS voice.

The /v1/audio/voice-clones endpoint accepts a short reference sample, validates it (no public figures, written consent attestation required), and returns a voice_id you can pass to the standard /v1/audio/speech endpoint.

Built on Chatterbox (MIT, Resemble AI). Every generated output carries an imperceptible PerTh watermark identifying it as machine-generated — abuse posture written into the product, not the policy.

See Chatterbox + cloning

Pricing

Eight models. Eight honest prices.

Each unit is the natural unit for that modality — STT bills per hour of audio, OCR per page, TTS per million characters. No surprise per-token tail.

ModelModalityLicenseLanguagesPricingVoice cloning
Whisper Large v3STTMIT (OpenAI Whisper open release)99EUR 0.2 / hour of audio
GLM-OCROCRApache 2.0 (Zhipu / GLM open release)EUR 0.0015 / page
Qwen3-ASRSTTApache 2.052EUR 0.15 / hour of audio
Mistral OCROCRMistral commercial license (proxied via la Plateforme) — OCR-25-03 weights on request11+ (multilingual)EUR 0.0025 / page
Qwen3-TTSTTSApache 2.010EUR 12 / million charactersYes
KokoroTTSApache 2.08EUR 3 / million charactersNo
ChatterboxTTSMIT (Resemble AI), built-in PerTh neural watermarking23EUR 18 / million charactersYes
PiperTTSMIT30EUR 1.5 / million charactersNo

Priced near the open-router market, not under it.

We could undercut OpenAI, Deepgram, and Cartesia by 60–80 % on these models — the Kathmandu cost base would let us — but we don't. STT at EUR 0.20/hour sits between OpenAI Whisper ($0.36) and Deepgram batch ($0.26). Chatterbox at EUR 18/M chars is a third of Cartesia Sonic 3 ($35/M) and a tenth of ElevenLabs premium. Cheap enough to be the easy choice on cost; expensive enough that the hosting economics work and the service stays reliable. The differentiator is jurisdiction, license, and posture — not a discount race.

Utility safety

Audio in, audio out — handled like infrastructure.

Voice cloning is dangerous if shipped without guardrails. We bake consent, watermarking, and abuse circuit breakers into the API itself instead of bolting them onto a terms-of-service page.

No payload logging Audio, images, and generated voice are not logged by default. Tenants can opt in to per-call logging for their own debugging.
No training on customer data We do not use customer audio, images, prompts, or outputs to train our hosted models.
PerTh watermark on TTS Chatterbox outputs carry an imperceptible watermark identifying them as machine-generated.
Consent-gated voice cloning Voice clone creation requires a written consent attestation. Celebrity / public-figure clones are blocked at validation.
Tenant-scoped voices Voice clones are private to your tenant; we do not share them with any other customer.
Abuse circuit breakers Rate limits, output caps, and pattern detection on generated audio cut runaway or abusive usage.

Practical questions

Before you point your SDK at our endpoint.

How does pricing compare to OpenAI, ElevenLabs, Deepgram, and Cartesia?

Whisper at EUR 0.20/hour sits roughly between OpenAI Whisper ($0.36/hour) and Deepgram batch ($0.26/hour). GLM-OCR at EUR 0.0015/page matches Google Vision and AWS Textract basic OCR. Chatterbox at EUR 18/M chars is about a third of Cartesia Sonic 3 ($35/M) and roughly a tenth of ElevenLabs premium tier ($180–$300/M). Cheap enough to be the obvious choice on cost, expensive enough that the hosting economics work — not a discount race.

How does voice cloning consent work?

When you create a voice clone, the API requires a consent attestation: who provided the sample, when consent was given, and an acknowledgement that the voice is yours or licensed. We reject samples that match known public figures (politicians, celebrities, well-known media personalities). Tenants who repeatedly attempt to clone non-consented voices get suspended.

Why is there a watermark on Chatterbox output?

Chatterbox includes PerTh neural watermarking from Resemble AI — an imperceptible signal embedded in the audio that identifies it as machine-generated. This is industry best practice for responsible voice synthesis. The watermark does not affect audio quality.

How does this relate to Voicebox.sh?

Voicebox.sh is Jamie Pine's MIT-licensed local desktop app that bundles seven TTS engines, with Qwen3-TTS as its primary one. It is excellent for laptop-side prototyping and voice production. We host the same Qwen3-TTS engine — plus Chatterbox, Kokoro, and Piper — server-side on our Kathmandu GPUs, so the natural workflow is prototype locally in Voicebox.sh, then point your production code at our hosted API for scale.

What about Voxtral or XTTS?

Voxtral (Mistral) is open-weight but CC-BY-NC 4.0 — commercial hosting requires a separate agreement with Mistral. We are talking to them. XTTS v2 sits under the Coqui Public Model License, which similarly blocks commercial public-API hosting. Neither model is in our catalog today; we will add them only when the licensing path is clean.

Can I chain these with the LLM catalog?

Yes — that is the point. Audio in → Whisper transcription → LLM → Chatterbox audio out → all on one Kathmandu round trip. We see this most in voice agents, IVR replacements, voice-note-to-task pipelines, and multilingual content workflows.

What about real-time / streaming speech?

Whisper supports streaming transcription on the WebSocket endpoint. Chatterbox and Kokoro support streaming TTS over chunked HTTP, with Kokoro recommended for the lowest first-byte latency.

Is there a free tier?

No persistent free tier — utilities are pay-per-use. We can usually arrange a small trial allowance for serious POCs; talk to us when you sign up.