Speech-to-text · STT

Qwen3-ASR

Apache 2.0 STT with 52-language coverage — the wider-language alternative to Whisper.

Qwen3-ASR is Alibaba’s open-weight automatic speech recognition model from the Qwen family — 1.7B parameters, Apache 2.0, covering 52 languages. We host it alongside Whisper Large v3 as the cost-sensitive, attribution-free alternative.

When to pick Qwen3-ASR over Whisper

  • License hygiene matters. Whisper is MIT — close to Apache 2.0 — but some procurement teams prefer Apache 2.0 with its explicit patent grant and no-attribution requirement. Qwen3-ASR is the Apache option.
  • Cost-sensitive batch transcription. EUR 0.07/hour vs Whisper’s EUR 0.10/hour. Over millions of hours, the difference adds up.
  • Latency budget is tight. 1.7B vs 1.55B is similar; the architectural choices in Qwen3-ASR give it a small first-byte-latency edge in streaming mode.
  • Your workload is heavy in Asian languages. Qwen3-ASR’s training set has stronger coverage on Chinese, Japanese, Korean, Hindi, and other South / East Asian languages than Whisper.

When to pick Whisper instead

  • Maximum language coverage. Whisper supports 99 languages, Qwen3-ASR 52. If you need the long tail of low-resource languages, Whisper wins.
  • Mature community tooling. Whisper’s ecosystem (timestamps, diarization, translation-to-English, fine-tuning recipes) is more mature.
  • Reference benchmark numbers. Most academic and product benchmarks reference Whisper; staying on the same model simplifies comparison.

Pricing

EUR 0.07 per hour of audio. Billed in 1-second increments after the first 30 seconds; the first 60 minutes per tenant per month are free.

Limits

  • Per-tenant rate limit: 60 minutes of audio per minute (60× real-time)
  • File size limit: 1 GB per request
  • Supported formats: mp3, wav, flac, m4a, ogg, opus, webm

Why we picked it as the second STT

We surveyed the field — NVIDIA Canary-Qwen 2.5B has slightly higher accuracy but uses CC-BY-4.0 (attribution required per-use). IBM Granite Speech 3.3 is Apache 2.0 but only covers 8 languages well. NVIDIA Parakeet TDT is the throughput champion but again CC-BY-4.0. Qwen3-ASR’s combination of Apache 2.0 + 52 languages + sub-Whisper pricing made it the clean choice for our second STT slot.

Best for

  • Multilingual transcription where Apache 2.0 attribution-free licensing matters
  • Cost-sensitive batch transcription where Whisper is the more expensive choice
  • Workloads where a 1.7B model fits the latency budget Whisper Large v3 doesn't
  • South / East Asian languages where Qwen's training set has strong coverage

Upstream source: github.com/QwenLM/Qwen3-ASR