Speech · Vision · Voice — live today, pay-per-use

The audio and vision APIs that agents reach for.

Whisper and Qwen3-ASR for speech-to-text; GLM-OCR, dots.ocr, and olmOCR-2 for document vision (with native Nepali / Devanagari support); and four TTS engines led by Chatterbox and Qwen3-TTS — both with zero-shot voice cloning, Chatterbox preferred over ElevenLabs in published blind tests. All OpenAI-compatible endpoints, all hosted next to our LLM inference, all priced near the open-router market.

Get API access See the catalog

OpenAI-compatible API Voice cloning included No payload logging PerTh watermark on TTS

utilities.stack Our GPU racks · OpenAI-compatible

Your agent / app Any OpenAI-compatible SDK or HTTP client

Utility gateway /v1/audio/* · /v1/images/* · /v1/audio/voice-clones

STT · OCR · TTS · voice cloning nine open-weight models on our GPUs

Why this exists

Speech, vision, and voice should be the same provider as your LLM.

Live today

No waitlist. Pay-per-use against the published rates below. Sign up, get a key, start calling.

OpenAI-compatible

Endpoints under /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/voice-clones, /v1/images/* — drop-in for any OpenAI-compatible client.

Open-weight models

Every model is open-weight and commercially licensed. We do not silently route to a closed-source backend.

Hosted next to LLMs

Audio and vision calls share the same private network as our LLM inference — one round trip when you chain them.

Catalog

Two STT, three OCR, four TTS — nine open-weight models.

Nine models across three modalities. Each gets its own detail page with sample code, language coverage, and limits.

Speech-to-text

STT

Whisper Large v3

Multilingual speech-to-text — 99 languages, including Nepali.

See Whisper Large v3 → →

STT

Qwen3-ASR

Apache 2.0 STT with 52-language coverage — the wider-language alternative to Whisper.

See Qwen3-ASR → →

Document vision · OCR

OCR

GLM-OCR

Document vision for receipts, PDFs, screenshots, forms.

See GLM-OCR → →

OCR

dots.ocr

Multilingual document OCR — 100+ languages including Nepali. MIT, 3B.

See dots.ocr → →

OCR

olmOCR-2

Premium document OCR — English / academic / legal / handwritten. Apache 2.0, Ai2.

See olmOCR-2 → →

Text-to-speech + voice cloning

TTS+CLONE

Chatterbox

The headline TTS — premium quality, emotion control, zero-shot voice cloning. MIT-licensed.

See Chatterbox → →

TTS+CLONE

Qwen3-TTS

Alibaba's open-weight voice cloning model. 3-second clones, 97 ms latency, Apache 2.0.

See Qwen3-TTS → →

TTS

Kokoro

Small, fast TTS — built for low-latency agent loops.

See Kokoro → →

TTS

Piper

Open-weight, efficient TTS — multilingual and CPU-friendly.

See Piper → →

Voice cloning API

Clone a voice with a 5–10 second sample, then use it like any other TTS voice.

The /v1/audio/voice-clones endpoint accepts a short reference sample, validates it (no public figures, written consent attestation required), and returns a voice_id you can pass to the standard /v1/audio/speech endpoint.

Built on Chatterbox (MIT, Resemble AI). Every generated output carries an imperceptible PerTh watermark identifying it as machine-generated — abuse posture written into the product, not the policy.

See Chatterbox + cloning

Pricing

Nine models, nine published prices.

Each unit is the natural unit for that modality — STT bills per hour of audio, OCR per page, TTS per million characters. No surprise per-token tail.

Model	Modality	License	Languages	Pricing	Voice cloning
Whisper Large v3	STT	MIT (OpenAI Whisper open release)	99	EUR 0.15 / hour of audio	—
Qwen3-ASR	STT	Apache 2.0	52	EUR 0.15 / hour of audio	—
GLM-OCR	OCR	Apache 2.0 (Zhipu / GLM open release)	—	EUR 0.0015 / page	—
dots.ocr	OCR	MIT (rednote-hilab)	100+ (Nepali / Devanagari, Hindi, Bengali, Arabic, Thai, plus most Asian and European scripts)	EUR 0.002 / page	—
olmOCR-2	OCR	Apache 2.0 (Allen Institute for AI)	English-first; competent on European Latin scripts	EUR 0.0025 / page	—
Chatterbox	TTS	MIT (Resemble AI), built-in PerTh neural watermarking	23	EUR 18 / million characters	Yes
Qwen3-TTS	TTS	Apache 2.0	10	EUR 12 / million characters	Yes
Kokoro	TTS	Apache 2.0	8	EUR 3 / million characters	No
Piper	TTS	MIT	30	EUR 2 / million characters	No

Priced near the open-router market, not under it.

Priced near the open-router market, not under it. STT at EUR 0.15/hour sits below Deepgram Nova-3 batch ($0.26/hour) and well under OpenAI Whisper API ($0.36/hour). TTS at EUR 18/M characters is roughly a third of Cartesia Sonic 3 and a fifth of ElevenLabs Multilingual v2 ($100/M PAYG). Cheap enough to be the easy choice on cost; expensive enough that the hosting economics work and the service stays reliable. The differentiator is jurisdiction, license, and posture — not a discount race.

Utility safety

Audio in, audio out — handled like infrastructure.

Voice cloning is dangerous if shipped without guardrails. We bake consent, watermarking, and abuse circuit breakers into the API itself instead of bolting them onto a terms-of-service page.

No payload logging Audio, images, and generated voice are not logged by default. Tenants can opt in to per-call logging for their own debugging.

No training on customer data We do not use customer audio, images, prompts, or outputs to train our hosted models.

PerTh watermark on TTS Chatterbox outputs carry an imperceptible watermark identifying them as machine-generated.

Consent-gated voice cloning Voice clone creation requires a written consent attestation. Celebrity / public-figure clones are blocked at validation.

Tenant-scoped voices Voice clones are private to your tenant; we do not share them with any other customer.

Abuse circuit breakers Rate limits, output caps, and pattern detection on generated audio cut runaway or abusive usage.

Get utility API access

No deposit. No waitlist. Tell us which APIs you need.

The utility APIs are live and pay-per-use. Drop your details, pick the modality you care about most, and we'll send your API key and endpoint URLs within one working day.

Whisper + Qwen3-ASR (STT), GLM-OCR + dots.ocr + olmOCR-2 (OCR including Nepali), Chatterbox / Qwen3-TTS / Kokoro / Piper (TTS) — all live today.
Straight pay-per-use against the rates above — no minimums, no commitment.
Voice cloning requires a written consent attestation at clone-creation time.
NPR, EUR, or USD invoicing — your finance team's currency.

Continue in the ScaLabs Cloud Console

We'll create your account and email you a 6-digit sign-in code. Finish the request inside the console.

Practical questions

Before you point your SDK at our endpoint.

How does pricing compare to OpenAI, ElevenLabs, Deepgram, and Cartesia?

Whisper at EUR 0.15/hour sits below Deepgram Nova-3 batch ($0.26/hour) and well under OpenAI Whisper ($0.36/hour). On OCR: GLM-OCR at EUR 0.0015/page matches Google Vision / AWS Textract entry rate (and drops to EUR 0.0008/page above 500k pages/month); dots.ocr at EUR 0.0020 covers Devanagari and 100+ languages; olmOCR-2 at EUR 0.0025 handles academic / legal / handwritten — all well under any LLM-vision call. Chatterbox at EUR 18/M chars is about a third of Cartesia Sonic 3 ($35/M) and roughly a fifth of ElevenLabs Multilingual v2 ($100/M PAYG). Cheap enough to be the obvious choice on cost, expensive enough that the hosting economics work — not a discount race.

How does voice cloning consent work?

When you create a voice clone, the API requires a consent attestation: who provided the sample, when consent was given, and an acknowledgement that the voice is yours or licensed. We reject samples that match known public figures (politicians, celebrities, well-known media personalities). Tenants who repeatedly attempt to clone non-consented voices get suspended.

Why is there a watermark on Chatterbox output?

Chatterbox includes PerTh neural watermarking from Resemble AI — an imperceptible signal embedded in the audio that identifies it as machine-generated. This is industry best practice for responsible voice synthesis. The watermark does not affect audio quality.

How does this relate to Voicebox.sh?

Voicebox.sh is Jamie Pine's MIT-licensed local desktop app that bundles seven TTS engines, with Qwen3-TTS as its primary one. It is excellent for laptop-side prototyping and voice production. We host the same Qwen3-TTS engine — plus Chatterbox, Kokoro, and Piper — server-side on our GPUs, so the natural workflow is prototype locally in Voicebox.sh, then point your production code at our hosted API for scale.

What about Voxtral or XTTS?

Voxtral (Mistral) is open-weight but CC-BY-NC 4.0 — commercial hosting requires a separate agreement with Mistral. XTTS v2 sits under the Coqui Public Model License, which similarly blocks commercial public-API hosting. Neither model is in our catalog today; we will add them only when the licensing path is clean.

Can I chain these with the LLM catalog?

Yes — that is the point. Audio in → Whisper transcription → LLM → Chatterbox audio out → all on one round trip. We see this most in voice agents, IVR replacements, voice-note-to-task pipelines, and multilingual content workflows.

What about real-time / streaming speech?

Whisper supports streaming transcription on the WebSocket endpoint. Chatterbox and Kokoro support streaming TTS over chunked HTTP, with Kokoro recommended for the lowest first-byte latency.

Is there a free tier?

No persistent free tier — utilities are pay-per-use. We can usually arrange a small trial allowance for serious POCs; talk to us when you sign up.