Speech-to-text · STT

Whisper Large v3

Multilingual speech-to-text — 99 languages, including Nepali.

Whisper Large v3 is the third-generation release of OpenAI’s open Whisper model — robust, multilingual speech recognition trained on 680k+ hours of multilingual and multitask supervised data from the web.

We host it on Kathmandu GPUs with two endpoint shapes:

  • /v1/audio/transcriptions — OpenAI-compatible drop-in. Send a file or a URL; get word-level timestamps, language auto-detect, and an optional diarization track back.
  • /v1/audio/translations — translates any of 99 supported source languages directly into English.

When to pick it

  • Anything an agent needs to listen to: voice messages, calls, voice notes, meeting recordings, podcast clips
  • Multilingual workflows where you cannot pin the input language ahead of time
  • Pipelines where you want to chain transcription into a hosted-LLM call in the same private network round trip

Pricing

EUR 0.10 per hour of audio. About a third of the public OpenAI Whisper API rate. Billed in 1-second increments after the first 30 seconds; the first 60 minutes per tenant per month are free as a try-it allowance.

Limits

  • Per-tenant rate limit: 60 minutes of audio per minute (60× real-time)
  • File size limit: 1 GB per request
  • Supported formats: mp3, wav, flac, m4a, ogg, opus, webm

Best for

  • Voice-note triage for messaging-bot agents
  • Call transcription and meeting recap pipelines
  • Multilingual content moderation
  • Subtitling, dubbing prep, and translation pipelines

Upstream source: github.com/openai/whisper