What makes sunmoon.dev different from an agency?

You work directly with me — a SaaS founder who builds and operates his own AI products (transcribe.so, goodlisten.co). There's no account manager, no junior bench, and no hand-off. The person who scopes your build is the person who ships it.

How long does it take to build an AI application?

It depends on complexity. A basic AI agent can ship in 4–6 weeks; more involved solutions with custom features take 3–6 months. You'll get a concrete timeline on the first call before any commitment.

Do you provide ongoing support after launch?

Yes. Every build includes 6 months of email support, and I offer a discounted on-call package for extended maintenance and feature work as you scale.

What technologies do you use?

Industry-standard, mostly open tooling to avoid vendor lock-in: Next.js and Node for the app layer, Postgres/Supabase for data, and the right AI models picked per task (OpenAI, Qwen, Mistral, and others). Exact stack is chosen around your requirements.

How do you handle data privacy and security?

Encryption in transit and at rest, scoped access, and compliance-aware design. Where it matters, solutions can be deployed on your own infrastructure so you keep full control of your data.

Transparent and upfront. Consulting starts at $200/hour; AI agent and full-stack SaaS builds start at $10,000, with final cost depending on scope. You get a detailed quote after the first call.

How I Built transcribe.so: Lessons From Shipping a Multi-Model ASR Platform

If you're building an AI product on top of speech recognition, you've probably hit the wall I hit: a single model that looks amazing in the demo falls apart on a two-hour podcast with three accents, crosstalk, and a domain vocabulary it has never seen. The transcript is 92% right, which sounds great until you realize the 8% is every proper noun your user actually cares about.

I built transcribe.so to live in that gap. It's a transcription and audio-intelligence platform I run solo, and the architecture decisions I made are the kind nobody tells you about until you've shipped to real users and watched the support inbox. This is the case study I wish I'd read before I started.

Why I route across multiple models instead of betting on one

The instinct when you start is to pick "the best" ASR model and ship it. That instinct is wrong, and it took me a few painful months to internalize why.

There's no best model — only a best model for a given input. Audio is not uniform — a clean studio interview, a noisy field recording, a Korean-English code-switched call, and a lecture full of chemistry terms are four different problems. A model that wins on one loses badly on another. When I benchmarked candidates against my own corpus, the ranking reshuffled every time the audio profile changed.

So transcribe.so routes. Each job gets classified — language, audio quality, domain signals, length — and dispatched to the model most likely to win on that profile. Here's roughly how I think about the lineup:

Model family	Where it wins	Where I avoid it
OpenAI (Whisper-class)	Clean English, robust punctuation, broad coverage	Long-form cost, some non-English accents
Qwen	Strong multilingual + CJK, code-switching	Occasional formatting drift
Mistral	Fast, cheap, good European languages	Heavy domain jargon

The point isn't the exact cells — those shift as new checkpoints ship. The point is that routing turns model churn into an advantage instead of a migration project. When a better model lands, I add it as a route, run it against the corpus, and let the classifier promote it where it actually wins. I never do a big-bang swap.

The single most valuable asset I have isn't a model. It's a golden corpus of real, messy audio with verified transcripts. Models are rented; the eval is owned.

That corpus is how I make routing decisions without guessing. Every pipeline change runs against it before it ships. If accuracy drops on any audio profile, the change doesn't go out. I learned that discipline the hard way — early on I shipped a "better" model that quietly regressed CJK accuracy, and I only found out from a user.

How cited Q&A actually works

The feature people assume is simple — "ask a question about your audio" — is the one with the most failure modes. The naive version stuffs the transcript into a model and asks. It hallucinates timestamps, invents quotes, and confidently answers questions the audio never addressed.

The version I shipped makes citations a hard constraint:

Segment-level grounding. The transcript is chunked into timestamped segments before any question is asked. Every answer must point back to specific segments.
Retrieval before generation. A question retrieves candidate segments first. The model only sees what's relevant, which kills most hallucination and keeps cost flat as audio length grows.
Verifiable citations. Each claim links to a timestamp the user can click and hear. If the model can't ground a claim, it says so instead of inventing one.

This is the same instinct behind goodlisten.co, my other product, where surfacing the right moment in long audio matters more than summarizing it into mush. Long-form audio is mostly filler around a few load-bearing moments. The job of the product is to find those moments and prove they exist. An AI answer you can't verify is a liability, not a feature.

Scaling long-audio pipelines without a team

Once a file passes the hour mark, you're not running a transcription job anymore — you're running a distributed system. Here's the shape of the pipeline that took me the longest to get right:

Chunk, don't gulp

I split audio on silence boundaries, not fixed time windows, so I never cut a word in half. Each chunk transcribes independently and in parallel. This is what makes long files feel fast — a two-hour file isn't 24x slower than a five-minute one, because the chunks fan out.

Aggregate carefully

The hard part is the seams. Chunk boundaries create overlap and duplicated words, and naive concatenation produces stutters ("the the", repeated half-sentences). I run an aggregation step that de-duplicates across boundaries and reconciles speaker labels so a speaker stays the same person across chunk lines. Most of my accuracy bugs over the past year lived in this seam, not in the models.

Make the queue boring

Long jobs fail in the middle — a model times out, a chunk errors, a node dies. The pipeline has to resume from the failed chunk, not restart the whole file. Idempotent chunk jobs and a durable queue are unglamorous and essential. A user who uploaded a two-hour recording will not forgive you for losing it at minute 80.

My years at Spotify and Klarna shaped this more than any tutorial — at one, a dropped job meant a listener staring at a spinner; at the other, it meant money in the wrong place. I carried that assume-failure, design-for-resume reflex straight into transcribe.so. A stint with a YC-backed startup taught me the opposite lesson just as well: ship before it's perfect, because a corpus of real failures is worth more than any amount of pre-launch polish.

What I'd tell anyone building an AI product solo

A few things I believe more strongly now than when I started:

Own your eval, rent your models. The model layer is a commodity that changes monthly. Your evaluation corpus is the moat. Build it first.
Abstract the model boundary on day one. If swapping a provider touches more than one file, you've coupled too tightly. Routing is impossible without this.
Cost is a feature. Routing cheap models to easy audio and expensive ones to hard audio is what lets a solo founder run real margins without a funding round.
Make failure observable. I can tell you which audio profile regressed, on which model, in which pipeline stage, within minutes. That observability is the only reason one person can operate this.

You don't need a team to ship something real. You need a tight loop: a corpus, a router, a resumable pipeline, and the discipline to never ship a change that regresses the corpus.

Frequently Asked Questions

Why not just use one ASR model to keep it simple?

Because there is no single best model across all audio — the winner changes with language, noise, and domain. One model means accepting the worst case on every profile it's weak at. Routing sends each job to the model most likely to win on that profile, and it means a new model release is a candidate route to test, not a migration to plan.

How does multi-model routing affect cost?

It lowers it, often dramatically. Easy audio goes to fast, cheap models; only hard audio pays for the expensive ones. For a solo operator, this is the difference between healthy margins and burning cash, and it's a big reason I can run transcribe.so without outside funding.

How do you keep cited Q&A from hallucinating?

Citations are a hard constraint, not a formatting choice. Answers are grounded in timestamped transcript segments retrieved before generation, and every claim links back to a moment the user can click and verify. When a claim can't be grounded in the audio, the answer admits it — no quote gets invented to fill the gap.

Can a single person really operate a pipeline like this?

Yes, but only with the right foundations: an owned evaluation corpus, a clean model-routing boundary, a resumable chunked pipeline, and strong observability. Those four things are what let me ship, debug, and scale transcribe.so alone without it becoming a second full-time job.

If you're building something in this space and want a second set of eyes from someone who's shipped it solo, book a call.

Have something that needs shipping?