What makes sunmoon.dev different from an agency?

You work directly with me — a SaaS founder who builds and operates his own AI products (transcribe.so, goodlisten.co). There's no account manager, no junior bench, and no hand-off. The person who scopes your build is the person who ships it.

How long does it take to build an AI application?

It depends on complexity. A basic AI agent can ship in 4–6 weeks; more involved solutions with custom features take 3–6 months. You'll get a concrete timeline on the first call before any commitment.

Do you provide ongoing support after launch?

Yes. Every build includes 6 months of email support, and I offer a discounted on-call package for extended maintenance and feature work as you scale.

What technologies do you use?

Industry-standard, mostly open tooling to avoid vendor lock-in: Next.js and Node for the app layer, Postgres/Supabase for data, and the right AI models picked per task (OpenAI, Qwen, Mistral, and others). Exact stack is chosen around your requirements.

How do you handle data privacy and security?

Encryption in transit and at rest, scoped access, and compliance-aware design. Where it matters, solutions can be deployed on your own infrastructure so you keep full control of your data.

Transparent and upfront. Consulting starts at $200/hour; AI agent and full-stack SaaS builds start at $10,000, with final cost depending on scope. You get a detailed quote after the first call.

How to Choose the Right ASR Model for Your Product

You're building a product that turns speech into text, and you've hit the question that has no clean answer: which ASR model do you actually ship? There are a dozen credible options, every vendor quotes a word error rate that looks great in their own benchmark, and none of those numbers survive contact with your real audio.

I've spent the last two years living inside this problem. I run transcribe.so, a product whose entire reason to exist is producing accurate transcripts at a price that doesn't bankrupt me, and goodlisten.co, which leans on the same speech stack. Before that I was an engineer at Spotify and Klarna and spent time at a Y Combinator–backed startup, where I learned the expensive way that the "best" model on a leaderboard is rarely the right model for a shipping product.

Here's how I choose.

Stop optimizing for WER alone

Word error rate is the metric everyone leads with, and it's the one that misleads the most. WER tells you how many words a model got wrong on some dataset — usually clean, read-aloud English recorded in a studio. Your users are not in a studio. They're on a phone in a car, on a Zoom call with three people talking over each other, speaking accented English or switching languages mid-sentence.

A model with a 4% WER on LibriSpeech can post a 15% WER on your messy meeting audio. So WER matters, but only when measured on audio that looks like yours.

The single highest-leverage thing you can do before picking a model: build a 30–60 minute evaluation set from your own real recordings, transcribe each one by hand, and score every candidate model against that. Two days of work that saves you two months of regret.

Once you have that eval set, WER becomes one input among five. The other four usually decide the outcome.

The five axes I score

Every candidate model gets scored across five axes:

Accuracy (WER) on my audio — measured against my own eval set, never the vendor's benchmark.
Language coverage — which languages, and how gracefully it handles code-switching and accents.
Cost per hour — the number that determines whether the unit economics work at scale.
Diarization — can it tell speakers apart, and how reliably?
Latency — does the use case need real-time, or is batch fine?

The trap is treating these as independent. They aren't. A cheaper model with worse diarization might be the right call if your product never needs speaker labels. A slightly-less-accurate model that runs in real time wins outright if you're building live captions. The "best" model is the one that maximizes the axes your product depends on while staying inside your cost ceiling.

Accuracy is contextual

GPT-4o Transcribe is, in my testing, the strongest general-purpose option for clean-to-moderate English and the major European languages. It's the model I reach for when accuracy is the whole game and the audio isn't pathological. But it's also the one I have to watch on cost.

Language coverage is where most products break

If you serve a global user base, coverage stops being a footnote. Qwen3-ASR-Flash is the one I lean toward for broad multilingual work and Asian languages — it handles Mandarin, Japanese, and Korean noticeably better than the Western-trained models, and it's aggressive on price. Korean is the one language I can judge natively, and on my Korean eval clips it closed a gap GPT-4o couldn't.

Diarization and latency decide architecture

Voxtral sits in an interesting spot: open-weight, deployable on your own infrastructure, with solid latency characteristics that make it viable for streaming. If your compliance story requires keeping audio on your own servers, or you need to fine-tune, an open model stops being a nice-to-have and becomes the only acceptable answer.

A side-by-side comparison

Here's how the three models I evaluate most often stack up. Treat these as directional; the exact ranking shifts with your audio.

Dimension	GPT-4o Transcribe	Qwen3-ASR-Flash	Voxtral
Best at	Clean English + major EU languages	Broad multilingual, Asian languages	Self-hosted, streaming, fine-tuning
Relative WER (clean)	Excellent	Very good	Good
Language coverage	Wide	Widest, strong on CJK	Moderate, growing
Cost per hour	Higher	Low	Infra cost only (self-hosted)
Diarization	Via pipeline	Via pipeline	Via pipeline
Latency	API-bound	API-bound, fast	Lowest (local), streaming-capable
Deployment	API only	API only	Open weights, self-host

There's no clean winner, which is the point. GPT-4o takes raw English accuracy, Qwen takes coverage and price, Voxtral takes control and latency. The tiebreaker is your product, not a leaderboard.

Why I stopped picking one model

Running transcribe.so eventually forced a conclusion I resisted for a while: there is no single best ASR model, and committing to one means accepting its worst case on every job.

A two-minute English voicemail and a 90-minute multilingual panel discussion are not the same problem. Sending both to the same model means either overpaying for the easy job or under-delivering on the hard one. So I stopped choosing one. The platform looks at each piece of audio — language, length, whether speaker labels are needed, how clean the signal is — and routes it to the model that handles that profile best. Cheap, fast model for the easy jobs; the heavyweight only where it earns its cost.

That routing is the real product decision. "Which model?" stopped being a question I answer once and became one the system answers per request.

What routing looks like in practice

Short, clean, single-speaker English → cheapest model that clears your accuracy bar. Don't pay for capability you won't use.
Multilingual or accented → coverage-first model, even at a small accuracy trade-off.
Multi-speaker meetings → whichever pairing gives the cleanest diarization; readers forgive a wrong word faster than a wrong speaker.
Real-time captions → latency wins, full stop. A 2% accuracy gain is worthless if it arrives three seconds late.

You can build this routing layer yourself — it's a classifier plus a dispatch table — or you can use a platform like transcribe.so that already does it and absorbs the model churn on your behalf. New ASR models ship every few months; a routing layer means a better model is a config change, not a rewrite.

How I'd choose, in order

If I were starting fresh on your product tomorrow:

Build the eval set first. 30–60 minutes of your real audio, hand-transcribed. Non-negotiable.
Define your hard constraints. Cost ceiling per hour, latency requirement, data-residency rules. These eliminate options before accuracy even enters the conversation.
Score 2–3 candidates on your eval set across all five axes, not just WER.
Pick a default, then add routing the moment a second use case appears with a different profile.

It's the same discipline I carried out of Spotify and Klarna: measure on production-like data, and let your constraints do most of the eliminating before taste gets a vote.

Frequently Asked Questions

Is a lower WER always better?

No. WER is only meaningful when measured on audio that resembles what your users actually produce. A model with a stellar WER on clean read-aloud speech can fall apart on noisy, multi-speaker, or accented recordings. Always score candidates against your own evaluation set, and weigh WER alongside cost, latency, and diarization rather than in isolation.

Should I use a hosted API or self-host an open model like Voxtral?

It depends on your constraints, not your preferences. Hosted APIs like GPT-4o Transcribe and Qwen3-ASR-Flash get you to production fastest with no infrastructure to run. Self-hosting an open-weight model like Voxtral makes sense when you need data residency, the lowest possible latency, or the ability to fine-tune — and you're willing to own the operational cost.

Do I really need to support multiple models?

If your product only ever handles one kind of audio, a single well-chosen model is fine. But the moment you have meaningfully different profiles — short English clips and long multilingual recordings, say — one model forces a compromise on every job. Routing each request to the best-fit model is what keeps accuracy high and cost low at scale.

How do I handle diarization across these models?

None of these models does perfect end-to-end diarization on its own; in practice you pair the transcription model with a diarization step in your pipeline. The quality of that pairing matters more than the base model's raw WER for any multi-speaker use case, because a mislabeled speaker is more confusing to a reader than an occasional wrong word.

If you're weighing these trade-offs for a real product and would rather not run the whole evaluation alone, book a call and I'll walk you through it.

Have something that needs shipping?