sunmoon.dev
All writing

Shipping an AI MVP in 6 Weeks: The Plan I Actually Use

Seunghun Lee
AI MVPprocessfoundersAI agents

You have an AI product idea, a rough budget, and a window that's closing. Every week you spend "exploring the space" is a week a competitor spends shipping. And the standard advice — "just build an MVP" — doesn't tell you what to do on Monday morning.

I've shipped two AI products of my own, transcribe.so and goodlisten.co, plus client builds on top of that. The six-week plan below is the schedule I actually run, scar tissue included. It maps to the Scope → Build → Ship → Support process I use for client work, but here I'll give you the week-by-week version so you can run it yourself.

One framing rule before the calendar: an AI MVP is a workflow with a model inside it, not a model with a UI around it. Most six-week plans die in week one because the team scopes around the model ("we'll fine-tune X") instead of the workflow ("a podcast host pastes a URL and gets quotable highlights"). The workflow framing makes every later decision easier.

Week 1: Scope to one workflow

The single highest-leverage decision is what you don't build. In week one I force the product down to exactly one workflow with a named user, a clear input, and a clear output.

For transcribe.so, the first version was: a user uploads an audio file, gets an accurate transcript with speaker labels, and can ask questions against it. Not "an AI platform for audio intelligence." One upload, one transcript, one Q&A box.

What I produce by Friday of week one:

  • One sentence that names the user, the input, the output, and the painful alternative they're escaping. If any of the four is vague, the scoping isn't done.
  • A written cut list. Everything I'm explicitly not building: accounts beyond basic auth, billing, teams, admin dashboards, the second model, the second file format. The cut list matters more than the spec.
  • The model risk, named. Every AI product has one capability that decides whether it works at all — for transcribe.so it was transcription accuracy on real-world audio; for goodlisten.co it was whether an LLM could surface genuinely interesting podcast moments rather than generic summaries. Week one's job is to name that risk so week two can attack it first.

If you can't say what the user pastes in and what they get back, you don't have an MVP scope. You have a theme.

I also pick the boring stack in week one and never revisit it during the build. For me that's Next.js, Postgres, a queue, and hosted model APIs. My time at Spotify and Klarna taught me what infrastructure looks like at scale — and the honest lesson is that almost none of it belongs in an MVP. You earn complexity; you don't start with it.

Week 2: Build the thin slice — model risk first

Week two is one path through the system, end to end, ugly. Input goes in the front, the model does its job in the middle, output comes out the other side. No settings page, no retry logic polish, no design system.

The order matters: I build the riskiest model step first, with real inputs, before any surrounding product. For transcribe.so that meant running actual messy audio — accented speech, crosstalk, bad microphones — through candidate ASR models in the first days of week two, not week five. If the model can't do the core job, I want to know while there's still time to change approach: different model, different prompt architecture, or a narrower promise.

What "thin slice" means concretely:

  • One happy path, clickable in a browser, deployed to a real URL by Friday.
  • Real model calls, not mocked responses. Mocked LLM output is how teams discover in week five that the product doesn't work.
  • Hardcode what you can: one language, one file size limit, one user.

The discipline here is refusing to widen. Every "while we're at it" in week two costs you a day in week five.

Week 3: Real data, real inputs, no demo theater

This is the week most AI MVPs quietly fail, because teams keep testing on the three clean examples they wrote themselves. Curated demo inputs are a lie you tell yourself.

In week three I collect 30–100 real inputs — podcast episodes, customer documents, recordings with background noise — and run all of them through the slice. With goodlisten.co, generic prompts on clean test episodes produced highlights that looked fine and read like nothing. Only real episodes, in volume, exposed that the prompt needed to hunt for disagreement and specificity, not "key takeaways."

Two things come out of this week:

  1. A failure inventory. Every input that produced a bad output, categorized: model limitation, prompt issue, missing preprocessing, or out of scope. Out-of-scope failures go on the cut list with pride.
  2. A frozen golden set. Twenty to fifty representative inputs with what "good" looks like for each. This becomes the eval suite in week four. Building it now, while you're staring at real failures, takes hours. Building it later from memory takes days and captures the wrong cases.

Week 4: Evals before polish

Here's the trap: by week four the product roughly works, and every instinct says to make it pretty. Resist. Polish on top of an unmeasured model is paint on a house nobody inspected.

An eval doesn't need a framework. Mine are usually a script that runs the golden set through the pipeline and scores outputs — exact checks where possible (did the transcript preserve these known phrases? did the output parse as valid JSON?), an LLM-as-judge with a tight rubric where it's fuzzier, and a manual review pass for the judge itself. A day to build, and it changes everything downstream:

Without evals With evals
"The new prompt feels better" "The new prompt: 71% → 84% on the golden set"
Every model update is a gamble Model updates are a re-run and a diff
Regressions found by users Regressions found in CI
Prompt debates settled by seniority Prompt debates settled by numbers

On transcribe.so, evals are what let me run multiple ASR engines and route between them with confidence — I can verify engines against each other instead of guessing. That capability only exists because the measurement came before the polish.

Second half of week four: now polish, in strict order — error states first (what the user sees when the model fails or times out, because it will), then loading and progress for long-running jobs, then visuals. An AI MVP that handles failure gracefully feels more trustworthy than a beautiful one that hangs.

Week 5: Ship to 10 real users

Not a waitlist. Not a launch tweet. Ten actual humans with the problem, using the product on their own data while you watch.

I recruit them directly — DMs, communities where the pain lives, one or two warm intros. At a Y Combinator–backed startup I worked with, I saw up close how much signal a handful of motivated early users generates compared to a thousand anonymous signups: ten users who care will find every sharp edge in 48 hours.

The week-five checklist:

  • A real domain, auth, and just enough billing scaffolding to test willingness to pay — even a "request access" gate teaches you something.
  • Basic analytics and logging on the model pipeline: every input, output, latency, and failure, queryable. When a user says "it gave me something weird yesterday," you need to find that exact run.
  • A direct line to you — email or a shared channel. No support portal. You are support in week five.

Then the only metric that matters: do they come back without being prompted? Praise is noise. A second session on day three is signal.

Week 6: Iterate on signal, then decide

Week six is triage. With ten users you'll have a pile of feedback, and most of it is a trap — each user pulls toward their own edge case. I sort everything into three buckets:

  • Blocks the core workflow → fix this week.
  • Expands the workflow → backlog, post-MVP.
  • A different product wearing a costume → cut list. Politely.

The eval suite earns its keep here: I can fix week-six issues quickly because every change gets re-run against the golden set. Velocity in week six is a direct payout from discipline in week four.

End the week with an honest written verdict: which users returned, what they did, what broke, and whether the core promise held. Sometimes the verdict is "this works, pour fuel on it." Sometimes it's "the workflow is wrong but the capability is real, re-scope." Both beat six more months of building in the dark.

Why six weeks and not twelve

It's not arbitrary. Six weeks is long enough to build something real and short enough to forbid speculative architecture. The constraint does the prioritization for you: no microservices for zero users, no fine-tuning before prompting is exhausted, no admin panel before there's anything to administer.

The whole plan compresses into one sentence: scope to one workflow, attack model risk first, feed it real data early, measure before you polish, ship to ten humans, and let their behavior — not your roadmap — pick what happens next.

Frequently Asked Questions

Can I really ship an AI MVP in six weeks with one engineer?

Yes, if the scope honors week one. Both transcribe.so and goodlisten.co started as solo builds on boring infrastructure with hosted model APIs. The timeline fails when scope creeps — multiple workflows, custom training, premature platform architecture — not because one senior engineer is too few.

Should I fine-tune a model for my MVP?

Almost never in the first six weeks. Prompting plus retrieval over your own data covers most MVP use cases, and fine-tuning locks in cost and iteration drag before you've validated the workflow. Fine-tune after your evals prove prompting has plateaued on cases users actually hit.

What if my golden-set evals look bad in week four?

That's the system working — you found out in week four instead of in front of customers. Diagnose whether failures are prompt-level, model-choice-level, or scope-level, and fix in that order. If the model fundamentally can't do the job, narrow the promise; a smaller workflow that works beats a broad one that embarrasses you.

Do you build MVPs like this for clients?

Yes. This week-by-week plan is the engine behind the Scope → Build → Ship → Support process on the sunmoon.dev landing page. I take on a small number of builds at a time so each one gets the same attention my own products get — having operated transcribe.so and goodlisten.co in production, I've already paid for the mistakes so you don't have to.

If you've got an AI product that needs to exist in six weeks rather than six months, book a call and we'll scope week one together.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.