Speech-to-text · STT
Whisper Large v3
Multilingual speech-to-text — 99 languages, including Nepali.
Whisper Large v3 is the third-generation release of OpenAI’s open Whisper model — robust, multilingual speech recognition trained on 680k+ hours of multilingual and multitask supervised data from the web.
We host it on Kathmandu GPUs with two endpoint shapes:
/v1/audio/transcriptions— OpenAI-compatible drop-in. Send a file or a URL; get word-level timestamps, language auto-detect, and an optional diarization track back./v1/audio/translations— translates any of 99 supported source languages directly into English.
When to pick it
- Anything an agent needs to listen to: voice messages, calls, voice notes, meeting recordings, podcast clips
- Multilingual workflows where you cannot pin the input language ahead of time
- Pipelines where you want to chain transcription into a hosted-LLM call in the same private network round trip
Pricing
EUR 0.10 per hour of audio. About a third of the public OpenAI Whisper API rate. Billed in 1-second increments after the first 30 seconds; the first 60 minutes per tenant per month are free as a try-it allowance.
Limits
- Per-tenant rate limit: 60 minutes of audio per minute (60× real-time)
- File size limit: 1 GB per request
- Supported formats: mp3, wav, flac, m4a, ogg, opus, webm
Best for
- Voice-note triage for messaging-bot agents
- Call transcription and meeting recap pipelines
- Multilingual content moderation
- Subtitling, dubbing prep, and translation pipelines
Upstream source: github.com/openai/whisper
Request Whisper Large v3 access
Get an API key.
Straight pay-per-use against the published rate. No deposit, no minimums. Tell us what you're building and we'll send your API key and endpoint URL within one working day.