sunmoon.dev
All writing

How I Built transcribe.so: Lessons From Shipping a Multi-Model ASR Platform

Seunghun Lee
case studytranscribe.soASRAI agents

If you're building an AI product on top of speech recognition, you've probably hit the wall I hit: a single model that looks amazing in the demo falls apart on a two-hour podcast with three accents, crosstalk, and a domain vocabulary it has never seen. The transcript is 92% right, which sounds great until you realize the 8% is every proper noun your user actually cares about.

I built transcribe.so to live in that gap. It's a transcription and audio-intelligence platform I run solo, and the architecture decisions I made are the kind nobody tells you about until you've shipped to real users and watched the support inbox. This is the case study I wish I'd read before I started.

Why I route across multiple models instead of betting on one

The instinct when you start is to pick "the best" ASR model and ship it. That instinct is wrong, and it took me a few painful months to internalize why.

There is no single best model. There's a best model for a given input. Audio is not uniform — a clean studio interview, a noisy field recording, a Korean-English code-switched call, and a lecture full of chemistry terms are four different problems. A model that wins on one loses badly on another. When I benchmarked candidates against my own corpus, the ranking reshuffled every time the audio profile changed.

So transcribe.so routes. Each job gets classified — language, audio quality, domain signals, length — and dispatched to the model most likely to win on that profile. Here's roughly how I think about the lineup:

Model family Where it wins Where I avoid it
OpenAI (Whisper-class) Clean English, robust punctuation, broad coverage Long-form cost, some non-English accents
Qwen Strong multilingual + CJK, code-switching Occasional formatting drift
Mistral Fast, cheap, good European languages Heavy domain jargon

The point isn't the exact cells — those shift as new checkpoints ship. The point is that routing turns model churn into an advantage instead of a migration project. When a better model lands, I add it as a route, run it against the corpus, and let the classifier promote it where it actually wins. I never do a big-bang swap.

The single most valuable asset I have isn't a model. It's a golden corpus of real, messy audio with verified transcripts. Models are rented; the eval is owned.

That corpus is how I make routing decisions without guessing. Every pipeline change runs against it before it ships. If accuracy drops on any audio profile, the change doesn't go out. I learned that discipline the hard way — early on I shipped a "better" model that quietly regressed CJK accuracy, and I only found out from a user.

How cited Q&A actually works

The feature people assume is simple — "ask a question about your audio" — is the one with the most failure modes. The naive version stuffs the transcript into a model and asks. It hallucinates timestamps, invents quotes, and confidently answers questions the audio never addressed.

The version I shipped treats citations as a hard constraint, not a nice-to-have:

  • Segment-level grounding. The transcript is chunked into timestamped segments before any question is asked. Every answer must point back to specific segments.
  • Retrieval before generation. A question retrieves candidate segments first. The model only sees what's relevant, which kills most hallucination and keeps cost flat as audio length grows.
  • Verifiable citations. Each claim links to a timestamp the user can click and hear. If the model can't ground a claim, it says so instead of inventing one.

This is the same instinct behind goodlisten.co, my other product, where surfacing the right moment in long audio matters more than summarizing it into mush. Long-form audio is mostly filler around a few load-bearing moments. The job of the product is to find those moments and prove they exist — and "prove" is the operative word. An AI answer you can't verify is a liability, not a feature.

Scaling long-audio pipelines without a team

A 90-minute file is not one transcription job. It's a distributed system. Here's the shape of the pipeline that took me the longest to get right:

Chunk, don't gulp

I split audio on silence boundaries, not fixed time windows, so I never cut a word in half. Each chunk transcribes independently and in parallel. This is what makes long files feel fast — a two-hour file isn't 24x slower than a five-minute one, because the chunks fan out.

Aggregate carefully

The hard part is the seams. Chunk boundaries create overlap and duplicated words, and naive concatenation produces stutters ("the the", repeated half-sentences). I run an aggregation step that de-duplicates across boundaries and reconciles speaker labels so a speaker stays the same person across chunk lines. Most of my accuracy bugs over the past year lived in this seam, not in the models.

Make the queue boring

Long jobs fail in the middle — a model times out, a chunk errors, a node dies. The pipeline has to resume from the failed chunk, not restart the whole file. Idempotent chunk jobs and a durable queue are unglamorous and absolutely essential. A user who uploaded a two-hour recording will not forgive you for losing it at minute 80.

My time at Spotify and Klarna shaped this more than any tutorial did — both run audio and payments pipelines where a dropped job isn't a retry, it's a real-world consequence. I carried that "assume failure, design for resume" reflex straight into transcribe.so. (To be clear: those were my employers, not partners of this studio.) Earlier, at a Y Combinator–backed startup, I learned the opposite lesson just as deeply — ship before it's perfect, because the corpus of real failures is worth more than any amount of pre-launch polish.

What I'd tell anyone building an AI product solo

A few things I believe more strongly now than when I started:

  • Own your eval, rent your models. The model layer is a commodity that changes monthly. Your evaluation corpus is the moat. Build it first.
  • Abstract the model boundary on day one. If swapping a provider touches more than one file, you've coupled too tightly. Routing is impossible without this.
  • Cost is a feature. Routing cheap models to easy audio and expensive ones to hard audio isn't penny-pinching — it's what lets a solo founder run real margins without a funding round.
  • Make failure observable. I can tell you which audio profile regressed, on which model, in which pipeline stage, within minutes. That observability is the only reason one person can operate this.

You don't need a team to ship something real. You need a tight loop: a corpus, a router, a resumable pipeline, and the discipline to never ship a change that regresses the corpus.

Frequently Asked Questions

Why not just use one ASR model to keep it simple?

Because there is no single best model across all audio — the winner changes with language, noise, and domain. One model means accepting the worst case on every profile it's weak at. Routing lets each job go to the model most likely to win, and it turns the constant churn of new models into an upgrade path instead of a migration.

How does multi-model routing affect cost?

It lowers it, often dramatically. Easy audio goes to fast, cheap models; only hard audio pays for the expensive ones. For a solo operator, this is the difference between healthy margins and burning cash, and it's a big reason I can run transcribe.so without outside funding.

How do you keep cited Q&A from hallucinating?

I treat citations as a hard constraint. Answers are grounded in timestamped transcript segments retrieved before generation, and every claim links back to a moment the user can click and verify. If the model can't ground a claim in the audio, it says so rather than inventing one.

Can a single person really operate a pipeline like this?

Yes, but only with the right foundations: an owned evaluation corpus, a clean model-routing boundary, a resumable chunked pipeline, and strong observability. Those four things are what let me ship, debug, and scale transcribe.so alone without it becoming a second full-time job.

If you're building something in this space and want a second set of eyes from someone who's shipped it solo, book a call.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.