sunmoon.dev
All writing

Building a Speech-to-Text Product: The Architecture That Survived Production

Seunghun Lee
ASRspeech to textarchitecturecase study

Your team has been asked to add transcription to the product. You wire up an ASR API, test it on a two-minute clip, ship it behind a flag — and then a user uploads a three-hour board meeting recorded on a phone in a conference room, the request times out, the retry double-bills you, and the transcript that finally arrives has no idea who said what or when.

I've been through that exact arc. I run transcribe.so, a production speech-to-text product, and before that I spent years as a senior engineer at Spotify and Klarna and worked with a Y Combinator–backed startup. This post is the architecture that survived contact with real users: chunking, durable pipelines, multi-model routing, timestamps, exports, and the cost math nobody publishes.

The naive version fails in a predictable order

Almost every team builds the same first version: an HTTP endpoint that accepts an upload, calls an ASR API synchronously, and returns the text. It falls apart in stages:

  • First the timeout. Long audio takes minutes to transcribe. HTTP connections, load balancers, and serverless functions all give up before the model does.
  • Then the retry storm. The client retries the timed-out request, you transcribe the same file twice, and you pay twice.
  • Then the memory blowup. A 2 GB WAV file does not fit in a lambda. Someone will upload one.
  • Then the accuracy complaints. One model, one language, one audio condition — the moment usage diversifies, your single hardcoded model is wrong for half your traffic.

Every fix below exists because one of these happened to me in production.

Chunking: the decision everything else depends on

Long audio has to be split. The question is where you split it, because the chunk boundary is where errors concentrate: a word cut in half is a word transcribed wrong on both sides.

What works in practice:

  • Normalize first. Transcode everything to a single intermediate format (16 kHz mono is the usual target) before chunking. ffmpeg does this in a streaming pass; never load the whole file into memory.
  • Split on silence, not on time. Fixed 30-second windows slice through words. Use voice-activity detection to find silence gaps and place boundaries there, with a hard maximum chunk length as a fallback for people who never stop talking.
  • Overlap as insurance. A few seconds of overlap between adjacent chunks lets the aggregation step reconcile boundary words instead of losing them. You de-duplicate by aligning the overlapping word timestamps.
  • Keep offsets from the start. Every chunk carries its absolute start time in the original file. Without this, word-level timestamps and subtitle export downstream are unrecoverable.

The aggregator that stitches chunk transcripts back together is the hardest code in the system. Budget real engineering time for it; it is not a string concat.

Durable jobs, not requests

Transcription is a pipeline. On transcribe.so the shape is: upload (resumable, direct to object storage) → transcode → chunk → transcribe each chunk in parallel → aggregate → enrich (speakers, punctuation, summary) → export. Each arrow is a queue.

The properties that matter more than your choice of queue technology:

  • Idempotency keys on every step. A chunk transcribed twice must produce one result and one charge. Key on content hash plus step name.
  • Per-step retries with backoff, because ASR providers have bad minutes, and a transient 500 on chunk 41 of 60 should not restart the whole job.
  • A dead-letter path with the audio still addressable, so a failed job can be replayed after a fix instead of asking the user to re-upload.
  • Job state in a database, progress events to the client. Users will wait several minutes for a long file if they can see chunk-level progress. They will not wait thirty seconds in front of a spinner.

The ASR model is maybe 20% of the system. The other 80% is a media pipeline with strong opinions about failure.

This is where my years at Spotify and Klarna show up in the design. Both drilled in the same lesson: at scale, the happy path is a rounding error. The system is its failure modes.

Multi-model routing: one model is never right

I learned this the slow way. Different ASR models win on different inputs — one is best on clean English speech, another on accented or multilingual audio, another on noisy field recordings, and self-hosted Whisper variants win when cost per minute matters more than peak accuracy.

So transcribe.so routes per job rather than hardcoding a provider. The routing signal is cheap to compute: declared or detected language, audio SNR from the transcode pass, duration, and the user's accuracy-versus-cost preference. A few practical notes:

  • Normalize provider outputs into one internal schema immediately — words, start/end times, confidence, speaker label. Every downstream feature reads the schema, never the provider response. This is what makes adding a new model a one-week job instead of a rewrite.
  • Keep a verification harness. I run a golden corpus of real-world audio (accents, noise, jargon, crosstalk) against every routing or prompt change. WER on a benchmark you didn't curate tells you almost nothing about your traffic.
  • Expect to swap models. The ASR leaderboard reshuffles every few months. If swapping the default model for a language is a config change, you ride the improvements. If it's a refactor, you fall behind.

The same multi-engine discipline carries over to my other product, goodlisten.co, which runs podcast audio through transcription before doing discovery and chaptering on top — different product, same pipeline bones. Building the abstraction once paid for itself the second time.

Word-level timestamps and subtitle export

Word timestamps are the substrate for almost every feature users pay for: click-to-seek players, highlight clips, quote extraction, and subtitles.

The chunking offsets you preserved earlier come due here: a word's absolute time is its chunk-relative time plus the chunk's offset, reconciled across overlaps. Get that wrong by even 200 ms and subtitles visibly lag the audio.

For subtitle export (SRT/VTT), the model gives you words; the format needs cues. The re-segmentation rules that produce subtitles people can actually read:

  • Roughly 42 characters per line, max two lines per cue
  • Cue duration between ~1 and 7 seconds
  • Break cues at punctuation and speaker changes, never mid-clause if avoidable
  • Snap cue boundaries to word timestamps, not arbitrary times

This is fiddly, boring code. It is also the difference between an export feature users trust and one they run through a fixer tool afterwards.

Cost per minute: the math you should do before building

Cost is where speech-to-text products quietly die, so here is the honest comparison I wish someone had written down for me:

Approach Typical cost per audio-minute Accuracy ceiling Ops burden When it wins
Premium API (Deepgram/AssemblyAI-class) $0.005–$0.015 Highest, with diarization built in Near zero Default for most products
Big-cloud ASR (GCP/AWS/Azure) $0.016–$0.024 Good Low Enterprise procurement constraints
Self-hosted Whisper on GPU $0.001–$0.006 at high utilization Good, no managed diarization High Sustained volume, >50k min/month
On-device / open small models ~$0 marginal Lowest Medium Privacy-first or offline products

Three things the table hides:

  • Utilization decides self-hosting. A GPU transcribing 8 hours a day is cheap per minute; the same GPU at 5% utilization is more expensive than any API. Most products under ~50k minutes a month should not self-host.
  • Your real unit cost includes everything around the model — transcoding compute, storage, egress, retries, and the enrichment LLM calls. My all-in cost runs roughly 1.5–2x the raw ASR line item.
  • Price per minute against value per minute. A user transcribing a client interview gets hours of value from a $0.05 file. Don't let the small absolute numbers push you into per-minute pricing wars; package by outcome.

If you're adding transcription to an existing product

The condensed playbook:

  1. Start with a managed API and async jobs from day one. Skip the synchronous version entirely — you'll throw it away within a month.
  2. Build the chunker, the aggregator, and the normalized transcript schema before any UI. They're the foundation everything sits on.
  3. Make model choice a routing decision, even if you launch with one model.
  4. Preserve word timestamps end to end, even if your v1 only shows plain text.
  5. Instrument cost per job from the first transcription, not after the first scary invoice.

That ordering front-loads the parts that are expensive to retrofit and defers everything that isn't.

Frequently Asked Questions

Should I self-host Whisper or use a managed ASR API?

Use a managed API until your sustained volume passes roughly 50,000 audio-minutes per month and you have someone who can own GPU infrastructure. Below that, API pricing beats your effective self-hosted cost once you account for idle GPU time and ops effort. Above it, self-hosting Whisper-class models can cut the raw transcription line item by 60–80%.

How do I get accurate word-level timestamps on long audio?

Preserve each chunk's absolute offset in the original file, request word timings from the model, and reconcile overlapping words at chunk boundaries during aggregation. The most common bug is dropping or double-counting boundary words, which silently shifts every timestamp after the seam. A golden test file with known word positions catches this in CI.

What's the hardest part of building a speech-to-text product?

Not the model — the pipeline. Resumable uploads, silence-aware chunking, idempotent retries, and the aggregator that stitches chunks back together consume most of the engineering time. The model call itself is a few dozen lines; running it reliably on three-hour files from real users is the product.

How long does it take to ship a production-grade transcription feature?

With a managed ASR API and an existing job-queue setup, a small team can ship a solid async version in three to six weeks. Multi-model routing, speaker diarization, and subtitle export typically add another month. The timeline doubles when teams start with a synchronous architecture and have to rebuild — which is why I recommend skipping it.


If you're adding voice or transcription to your product and want to skip the failure modes I paid for, book a call and I'll walk through your architecture with you.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.