sunmoon.dev
All writing

How to Choose the Right ASR Model for Your Product

Seunghun Lee
ASRspeech to textmodelstranscribe.so

You're building a product that turns speech into text, and you've hit the question that has no clean answer: which ASR model do you actually ship? There are a dozen credible options, every vendor quotes a word error rate that looks great in their own benchmark, and none of those numbers survive contact with your real audio.

I've spent the last two years living inside this problem. I run transcribe.so, a product whose entire reason to exist is producing accurate transcripts at a price that doesn't bankrupt me, and goodlisten.co, which leans on the same speech stack. Before that I was an engineer at Spotify and Klarna and spent time at a Y Combinator–backed startup, where I learned the expensive way that the "best" model on a leaderboard is rarely the right model for a shipping product.

Here's how I actually choose.

Stop optimizing for WER alone

Word error rate is the metric everyone leads with, and it's the one that misleads the most. WER tells you how many words a model got wrong on some dataset — usually clean, read-aloud English recorded in a studio. Your users are not in a studio. They're on a phone in a car, on a Zoom call with three people talking over each other, speaking accented English or switching languages mid-sentence.

A model with a 4% WER on LibriSpeech can post a 15% WER on your messy meeting audio. So WER matters, but only when measured on audio that looks like yours.

The single highest-leverage thing you can do before picking a model: build a 30–60 minute evaluation set from your own real recordings, transcribe each one by hand, and score every candidate model against that. Two days of work that saves you two months of regret.

Once you have that eval set, WER becomes one input among five. The other four usually decide the outcome.

The five dimensions that actually matter

When I evaluate an ASR model for a product, I score it across five axes, not one:

  • Accuracy (WER) on my audio — measured against my own eval set, never the vendor's benchmark.
  • Language coverage — which languages, and how gracefully it handles code-switching and accents.
  • Cost per hour — the number that determines whether the unit economics work at scale.
  • Diarization — can it tell speakers apart, and how reliably?
  • Latency — does the use case need real-time, or is batch fine?

The trap is treating these as independent. They aren't. A cheaper model with worse diarization might be the right call if your product never needs speaker labels. A slightly-less-accurate model that runs in real time wins outright if you're building live captions. The "best" model is the one that maximizes the axes your product depends on while staying inside your cost ceiling.

Accuracy is contextual

GPT-4o Transcribe is, in my testing, the strongest general-purpose option for clean-to-moderate English and the major European languages. It's the model I reach for when accuracy is the whole game and the audio isn't pathological. But it's also the one I have to watch on cost.

Language coverage is where most products break

If you serve a global user base, coverage stops being a footnote. Qwen3-ASR-Flash is the one I lean toward for broad multilingual work and Asian languages — it handles Mandarin, Japanese, and Korean noticeably better than the Western-trained models, and it's aggressive on price. When I tested it against my own non-English eval clips, it closed a gap that GPT-4o couldn't.

Diarization and latency decide architecture

Voxtral sits in an interesting spot: open-weight, deployable on your own infrastructure, with solid latency characteristics that make it viable for streaming. If your compliance story requires keeping audio on your own servers, or you need to fine-tune, an open model stops being a nice-to-have and becomes the only acceptable answer.

A side-by-side comparison

Here's how the three models I evaluate most often stack up. Treat these as directional — the exact numbers shift with your audio, and you should confirm against your own eval set.

Dimension GPT-4o Transcribe Qwen3-ASR-Flash Voxtral
Best at Clean English + major EU languages Broad multilingual, Asian languages Self-hosted, streaming, fine-tuning
Relative WER (clean) Excellent Very good Good
Language coverage Wide Widest, strong on CJK Moderate, growing
Cost per hour Higher Low Infra cost only (self-hosted)
Diarization Via pipeline Via pipeline Via pipeline
Latency API-bound API-bound, fast Lowest (local), streaming-capable
Deployment API only API only Open weights, self-host

None of these is a clean winner. That's the point. GPT-4o wins on raw English accuracy, Qwen wins on coverage and price, Voxtral wins on control and latency. Your product picks the winner, not the leaderboard.

Why I stopped picking one model

The uncomfortable truth I arrived at while building transcribe.so: there is no single best ASR model, and committing to one means accepting its worst case on every job.

A two-minute English voicemail and a 90-minute multilingual panel discussion are not the same problem. Sending both to the same model means either overpaying for the easy job or under-delivering on the hard one. So I stopped choosing one. The platform looks at each piece of audio — language, length, whether speaker labels are needed, how clean the signal is — and routes it to the model that handles that profile best. Cheap, fast model for the easy jobs; the heavyweight only where it earns its cost.

That auto-routing is the actual product decision. The model choice isn't a one-time selection; it's a function evaluated per request.

What routing looks like in practice

  • Short, clean, single-speaker English → cheapest model that clears your accuracy bar. Don't pay for capability you won't use.
  • Multilingual or accented → coverage-first model, even at a small accuracy trade-off.
  • Multi-speaker meetings → whichever pairing gives the cleanest diarization, since speaker errors are more damaging than the occasional wrong word.
  • Real-time captions → latency wins, full stop. A 2% accuracy gain is worthless if it arrives three seconds late.

You can build this routing layer yourself — it's a classifier plus a dispatch table — or you can use a platform like transcribe.so that already does it and absorbs the model churn on your behalf. New ASR models ship every few months; a routing layer means a better model is a config change, not a rewrite.

How I'd choose, in order

If I were starting fresh on your product tomorrow:

  1. Build the eval set first. 30–60 minutes of your real audio, hand-transcribed. Non-negotiable.
  2. Define your hard constraints. Cost ceiling per hour, latency requirement, data-residency rules. These eliminate options before accuracy even enters the conversation.
  3. Score 2–3 candidates on your eval set across all five axes, not just WER.
  4. Pick a default, then add routing the moment a second use case appears with a different profile.

That sequence has never steered me wrong, and it's the same discipline I carried out of Spotify and Klarna: measure on production-like data, let constraints do the heavy lifting, and don't fall in love with a benchmark.

Frequently Asked Questions

Is a lower WER always better?

No. WER is only meaningful when measured on audio that resembles what your users actually produce. A model with a stellar WER on clean read-aloud speech can fall apart on noisy, multi-speaker, or accented recordings. Always score candidates against your own evaluation set, and weigh WER alongside cost, latency, and diarization rather than in isolation.

Should I use a hosted API or self-host an open model like Voxtral?

It depends on your constraints, not your preferences. Hosted APIs like GPT-4o Transcribe and Qwen3-ASR-Flash get you to production fastest with no infrastructure to run. Self-hosting an open-weight model like Voxtral makes sense when you need data residency, the lowest possible latency, or the ability to fine-tune — and you're willing to own the operational cost.

Do I really need to support multiple models?

If your product only ever handles one kind of audio, a single well-chosen model is fine. But the moment you have meaningfully different profiles — short English clips and long multilingual recordings, say — one model forces a compromise on every job. Routing each request to the best-fit model is what keeps accuracy high and cost low at scale.

How do I handle diarization across these models?

None of these models does perfect end-to-end diarization on its own; in practice you pair the transcription model with a diarization step in your pipeline. The quality of that pairing matters more than the base model's raw WER for any multi-speaker use case, because a mislabeled speaker is more confusing to a reader than an occasional wrong word.


If you're weighing these trade-offs for a real product and would rather not run the whole evaluation alone, book a call and I'll walk you through it.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.