What makes sunmoon.dev different from an agency?

You work directly with me — a SaaS founder who builds and operates his own AI products (transcribe.so, goodlisten.co). There's no account manager, no junior bench, and no hand-off. The person who scopes your build is the person who ships it.

How long does it take to build an AI application?

It depends on complexity. A basic AI agent can ship in 4–6 weeks; more involved solutions with custom features take 3–6 months. You'll get a concrete timeline on the first call before any commitment.

Do you provide ongoing support after launch?

Yes. Every build includes 6 months of email support, and I offer a discounted on-call package for extended maintenance and feature work as you scale.

What technologies do you use?

Industry-standard, mostly open tooling to avoid vendor lock-in: Next.js and Node for the app layer, Postgres/Supabase for data, and the right AI models picked per task (OpenAI, Qwen, Mistral, and others). Exact stack is chosen around your requirements.

How do you handle data privacy and security?

Encryption in transit and at rest, scoped access, and compliance-aware design. Where it matters, solutions can be deployed on your own infrastructure so you keep full control of your data.

Transparent and upfront. Consulting starts at $200/hour; AI agent and full-stack SaaS builds start at $10,000, with final cost depending on scope. You get a detailed quote after the first call.

From Prototype to Production: Shipping LLM Features That Don't Break

The demo worked. You typed a prompt, the model returned something brilliant, the room nodded, and someone said "ship it." Then you tried to put it in front of real users, with real inputs, at real volume — and it started lying, timing out, costing more than your hosting bill, and failing silently in ways your tests never caught.

I've shipped a lot of LLM features into products I personally operate and get paged for. What separates a prototype from a production feature is rarely the model — it's everything around it. That surrounding work is what most teams skip, and it's why their AI features feel flaky.

Why the demo lies to you

A prototype is one happy-path call. Production is the long tail: the empty input, the 90-minute audio file, the prompt-injection attempt, the API outage at 2am, the user who pastes a novel into a text box sized for a sentence.

When I built the transcription and summarization pipeline behind transcribe.so, the model call was maybe 10% of the actual work. The other 90% was making the system behave predictably when the model — or the network, or the input — didn't.

A production LLM feature isn't a smart model with a prompt. It's a boring, well-instrumented system that happens to have a model in the middle.

Here's how I close that gap, in the order that matters.

1. Evals: stop testing by vibes

"I tried a few prompts and it looked good" is not a test plan. The single highest-leverage thing you can build is an eval set.

It doesn't need to be fancy. Start with:

20–50 real inputs that represent your actual traffic, including the ugly ones.
A graded expectation per input — exact match where possible, a rubric or an LLM-as-judge where it's fuzzy.
A one-command run that scores a prompt or model change against the whole set.

The first time you change a prompt to fix one case and your eval shows you broke three others, you'll understand why this is non-negotiable. On goodlisten.co, my eval set is what lets me swap models or rewrite a prompt without crossing my fingers — I get a regression number, not a hunch.

A useful rule: every production bug becomes a new eval case. That's how the suite compounds in value instead of going stale.

2. Guardrails: assume the input is hostile

Treat model output the way you'd treat user input — never trust it raw. Two directions to guard:

On the way in: validate and bound the input. Cap length, strip or escape control characters, and don't blindly concatenate user text into a system prompt. Prompt injection is real, and the fix is structural — keep untrusted content in clearly delimited data fields, not in instruction positions.

On the way out: never let raw model output flow straight into something that executes or persists. If you ask for JSON, validate it against a schema and reject what doesn't parse. If the output drives an action, gate it. I learned the discipline of treating every external response as untrusted at Klarna, where a malformed payment response counts as an incident, not a bug — the same instinct applies to model output.

3. Fallbacks: the model will fail, plan for it

Provider APIs go down. Requests time out. Rate limits hit at the worst moment. A production feature degrades gracefully instead of throwing a 500.

My fallback ladder, roughly:

Retry with backoff on transient errors — but with a tight timeout so you don't stack latency.
Failover to a second provider or model for the same task. Keeping prompts portable across at least two providers is cheap insurance.
Degrade to a simpler result — a smaller model, a cached answer, or a non-AI path — rather than a dead end.
Fail honestly when all else fails: a clear message and a retry option, never a spinner that never resolves.

In the transcribe.so pipeline, a single chunk failing doesn't sink the whole job — it retries, and if it still won't process, the job completes with that segment flagged rather than nuking an hour of someone's work.

4. Cost control: the bill is a feature

LLM costs scale with usage in a way that can quietly destroy your margins. I treat cost as a design constraint, same as latency.

Lever	What it does	When I reach for it
Right-sized model	Use the cheapest model that passes evals	Default for every task
Prompt caching	Reuse static context across calls	Long system prompts, RAG context
Output capping	Bound `max_tokens` to what you need	Always
Pre-filtering	Cheap heuristic before the expensive call	High-volume, low-signal inputs
Batching / async	Move non-urgent work off the hot path	Background enrichment

The biggest wins are usually the boring ones: don't send a frontier model where a small one passes your evals, and don't pay to re-process context you already sent. Prompt caching alone cut my per-job cost meaningfully once the pipeline matured.

5. Observability: you can't debug what you can't see

When a user says "the summary was wrong," you need to reconstruct exactly what happened. That means logging, for every model call:

The full resolved prompt (or a hash plus the variable parts).
Model, version, and parameters.
Token counts in and out, latency, and cost.
The raw response, before any parsing.
A trace ID that ties the call to the user request.

Coming from Spotify, where every backend service was instrumented to the hilt, going back to a black-box LLM call felt like flying blind. The fix is the same as any distributed system: structured logs, traces, and dashboards. The difference is you also want to sample and review actual outputs over time, because quality drifts in ways latency graphs won't show.

RAG-specific gotchas

If your feature retrieves context, half your "the model is dumb" problems are retrieval problems. Log what was retrieved, not just what was generated. The usual culprits:

The right chunk existed but ranked below the cutoff.
Chunking split the answer across two pieces, so neither was sufficient.
Stale embeddings after a content update.

If you can't see the retrieved context next to the answer, you'll spend days tuning prompts for a problem the prompt can't fix.

The order I actually ship in

For a new LLM feature, I don't build all of this on day one. I sequence it:

Eval set first — even 20 cases. It defines "working."
Happy path — get one good answer end to end.
Guardrails — validate input and output before real users touch it.
Observability — logging and traces before, not after, launch.
Fallbacks and cost control — harden once it's carrying load.

The teams that struggle do these in reverse: ship the happy path, then bolt on safety after the incidents. I've watched a Y Combinator–backed startup I worked with burn a launch window doing exactly that. The order is the lesson.

Frequently Asked Questions

How big does my eval set need to be to be useful?

Smaller than you think. Twenty to fifty representative cases will catch most regressions and is enough to compare prompts or models with confidence. The trick is that the cases must reflect real traffic — including the weird inputs — and every production bug should become a new case so the set grows with your understanding of the problem.

Do I really need a second LLM provider for fallbacks?

Not on day one, but you should keep your prompts portable so you can add one fast. A single-provider outage is a question of when, not if, and the difference between "we degraded gracefully" and "we were down" is often just having tested a second path. Even falling back to a smaller model from the same provider buys you resilience.

What's the cheapest way to cut LLM costs without hurting quality?

Right-size the model and cache aggressively. Most teams default to a frontier model for tasks a much cheaper one passes — your eval set tells you when that's safe. After that, prompt caching for repeated context and capping output tokens usually deliver the next biggest savings with zero quality cost.

Why does my RAG feature give wrong answers even with a good model?

Almost always retrieval, not generation. If the right context never makes it into the prompt — bad chunking, a low rank, stale embeddings — even the best model can't recover. Log the retrieved context alongside every answer, and you'll find most wrong answers trace back to retrieval bugs you can fix.

If you're staring at an LLM prototype that demos well but you don't trust in front of real users, that's exactly the gap I help teams close — book a call and we'll map out what production-ready looks like for your feature.

Have something that needs shipping?