sunmoon.dev
All writing

From Prototype to Production: Shipping LLM Features That Don't Break

Seunghun Lee
LLMAI engineeringproductionRAG

The demo worked. You typed a prompt, the model returned something brilliant, the room nodded, and someone said "ship it." Then you tried to put it in front of real users, with real inputs, at real volume — and it started lying, timing out, costing more than your hosting bill, and failing silently in ways your tests never caught.

I've shipped a lot of LLM features into products I personally operate and get paged for. The gap between a prototype and a production feature isn't model quality. It's everything around the model. This is the engineering work most teams skip, and it's exactly why their AI features feel flaky while a few feel solid.

Why the demo lies to you

A prototype is one happy-path call. Production is the long tail: the empty input, the 90-minute audio file, the prompt-injection attempt, the API outage at 2am, the user who pastes a novel into a text box sized for a sentence.

When I built the transcription and summarization pipeline behind transcribe.so, the model call was maybe 10% of the actual work. The other 90% was making the system behave predictably when the model — or the network, or the input — didn't.

A production LLM feature isn't a smart model with a prompt. It's a boring, well-instrumented system that happens to have a model in the middle.

Here's how I think about closing that gap, in the order that actually matters.

1. Evals: stop testing by vibes

You cannot improve what you can't measure, and "I tried a few prompts and it looked good" is not measurement. The single highest-leverage thing you can build is an eval set.

It doesn't need to be fancy. Start with:

  • 20–50 real inputs that represent your actual traffic, including the ugly ones.
  • A graded expectation per input — exact match where possible, a rubric or an LLM-as-judge where it's fuzzy.
  • A one-command run that scores a prompt or model change against the whole set.

The first time you change a prompt to fix one case and your eval shows you broke three others, you'll understand why this is non-negotiable. On goodlisten.co, my eval set is what lets me swap models or rewrite a prompt without crossing my fingers — I get a regression number, not a hunch.

A useful rule: every production bug becomes a new eval case. That's how the suite compounds in value instead of going stale.

2. Guardrails: assume the input is hostile

Treat model output the way you'd treat user input — never trust it raw. Two directions to guard:

On the way in: validate and bound the input. Cap length, strip or escape control characters, and don't blindly concatenate user text into a system prompt. Prompt injection is real, and the fix is structural — keep untrusted content in clearly delimited data fields, not in instruction positions.

On the way out: never let raw model output flow straight into something that executes or persists. If you ask for JSON, validate it against a schema and reject what doesn't parse. If the output drives an action, gate it. I learned the discipline of treating every external response as untrusted at Klarna, where a malformed payment response isn't a bug, it's an incident — the same instinct applies to model output.

3. Fallbacks: the model will fail, plan for it

Provider APIs go down. Requests time out. Rate limits hit at the worst moment. A production feature degrades gracefully instead of throwing a 500.

My fallback ladder, roughly:

  1. Retry with backoff on transient errors — but with a tight timeout so you don't stack latency.
  2. Failover to a second provider or model for the same task. Keeping prompts portable across at least two providers is cheap insurance.
  3. Degrade to a simpler result — a smaller model, a cached answer, or a non-AI path — rather than a dead end.
  4. Fail honestly when all else fails: a clear message and a retry option, never a spinner that never resolves.

In the transcribe.so pipeline, a single chunk failing doesn't sink the whole job — it retries, and if it still won't process, the job completes with that segment flagged rather than nuking an hour of someone's work.

4. Cost control: the bill is a feature

LLM costs scale with usage in a way that can quietly destroy your margins. I treat cost as a first-class design constraint, not an afterthought.

Lever What it does When I reach for it
Right-sized model Use the cheapest model that passes evals Default for every task
Prompt caching Reuse static context across calls Long system prompts, RAG context
Output capping Bound max_tokens to what you need Always
Pre-filtering Cheap heuristic before the expensive call High-volume, low-signal inputs
Batching / async Move non-urgent work off the hot path Background enrichment

The biggest wins are usually the boring ones: don't send a frontier model where a small one passes your evals, and don't pay to re-process context you already sent. Prompt caching alone cut my per-job cost meaningfully once the pipeline matured.

5. Observability: you can't debug what you can't see

When a user says "the summary was wrong," you need to reconstruct exactly what happened. That means logging, for every model call:

  • The full resolved prompt (or a hash plus the variable parts).
  • Model, version, and parameters.
  • Token counts in and out, latency, and cost.
  • The raw response, before any parsing.
  • A trace ID that ties the call to the user request.

Coming from Spotify, where every backend service was instrumented to the hilt, going back to a black-box LLM call felt like flying blind. The fix is the same as any distributed system: structured logs, traces, and dashboards. The difference is you also want to sample and review actual outputs over time, because quality drifts in ways latency graphs won't show.

RAG-specific gotchas

If your feature retrieves context, half your "the model is dumb" problems are actually retrieval problems. Log what was retrieved, not just what was generated. The usual culprits:

  • The right chunk existed but ranked below the cutoff.
  • Chunking split the answer across two pieces, so neither was sufficient.
  • Stale embeddings after a content update.

You'll waste days blaming the model for what is really a retrieval bug, unless you can see the retrieved context next to the answer.

The order I actually ship in

For a new LLM feature, I don't build all of this on day one. I sequence it:

  1. Eval set first — even 20 cases. It defines "working."
  2. Happy path — get one good answer end to end.
  3. Guardrails — validate input and output before real users touch it.
  4. Observability — logging and traces before, not after, launch.
  5. Fallbacks and cost control — harden once it's carrying load.

The teams that struggle do these in reverse: ship the happy path, then bolt on safety after the incidents. I've watched a Y Combinator–backed startup I worked with burn a launch window doing exactly that. The order is the lesson.

Frequently Asked Questions

How big does my eval set need to be to be useful?

Smaller than you think. Twenty to fifty representative cases will catch most regressions and is enough to compare prompts or models with confidence. The trick is that the cases must reflect real traffic — including the weird inputs — and every production bug should become a new case so the set grows with your understanding of the problem.

Do I really need a second LLM provider for fallbacks?

Not on day one, but you should keep your prompts portable so you can add one fast. A single-provider outage is a question of when, not if, and the difference between "we degraded gracefully" and "we were down" is often just having tested a second path. Even falling back to a smaller model from the same provider buys you resilience.

What's the cheapest way to cut LLM costs without hurting quality?

Right-size the model and cache aggressively. Most teams default to a frontier model for tasks a much cheaper one passes — your eval set tells you when that's safe. After that, prompt caching for repeated context and capping output tokens usually deliver the next biggest savings with zero quality cost.

Why does my RAG feature give wrong answers even with a good model?

Almost always retrieval, not generation. If the right context never makes it into the prompt — bad chunking, a low rank, stale embeddings — even the best model can't recover. Log the retrieved context alongside every answer, and you'll find most "dumb model" complaints are actually retrieval bugs you can fix.


If you're staring at an LLM prototype that demos well but you don't trust in front of real users, that's exactly the gap I help teams close — book a call and we'll map out what production-ready looks like for your feature.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.