sunmoon.dev
All writing

What Production AI Agents Actually Cost to Run (Not Just Build)

Seunghun Lee
AI agentspricingLLMproduction

Your agent demo worked. The build estimate came in, the prototype impressed everyone, and someone signed off on shipping it. Then three months later the finance person asks why the "AI line item" is four times what the proposal said, and nobody has a good answer — because the proposal only priced the build, and an agent's real cost lives in the running.

I operate two AI products solo — transcribe.so, a multi-engine transcription service, and goodlisten.co, an AI podcast discovery and creator studio. Both have LLM-driven pipelines in production, both bill real customers, and both taught me the same lesson the hard way: build cost is a one-time number you can estimate; run cost is a recurring number you have to design for. Here's where the money goes, based on my own invoices.

The cost iceberg: what the proposal misses

Most agent proposals price the visible tenth: engineering time plus a back-of-napkin inference estimate ("roughly $0.01 per request times expected volume"). Below the waterline sits everything that makes the agent stay good:

  • Retries and fallbacks. Production LLM calls fail — rate limits, timeouts, malformed JSON, content filters. A 4–8% retry rate is normal on a healthy pipeline. Every retry is a full-price inference call that produces nothing new for the user.
  • Evals. If you change a prompt and don't run it against a regression set, you're shipping blind. My eval runs for the transcription enrichment pipeline cost real inference money every time I touch a prompt — and I touch prompts weekly.
  • Monitoring and observability. Traces, token accounting, latency percentiles per model per route. The tooling is either a SaaS subscription or your own time, and your time is the most expensive line on the sheet.
  • Model upgrades. Providers deprecate models on roughly 6–12 month cycles. Each migration is a mini-project: re-run evals, adjust prompts, fix the three behaviors that silently changed.
  • Prompt drift. The agent's inputs change shape over time — users get weirder, upstream data formats shift, edge cases accumulate. Prompts that scored 95% at launch quietly degrade. Catching that costs eval runs; fixing it costs engineering time.

When I was at Klarna and Spotify, infrastructure had whole teams dedicated to exactly this — keeping a system as good as the day it shipped. An AI agent needs the same kind of upkeep; most budgets pretend it doesn't exist.

How the bill breaks down

Exact figures depend on volume and models, so here's the shape of the math instead. For a moderate-volume agent — say 50,000 task executions a month, each involving a few LLM calls:

Cost line Typical share of monthly run cost What drives it
Primary inference 45–60% Tokens per task × volume × model price
Retries and fallbacks 5–10% Failure rate, timeout policy, validation strictness
Evals and regression runs 5–15% How often you change prompts/models; eval set size
Monitoring/observability 5–10% SaaS tooling or self-hosted + your time
Model migrations (amortized) 10–15% Provider deprecation cycles, re-tuning effort
Human review / spot checks 5–15% Quality bar, compliance needs

The headline: primary inference is often barely half the bill. My rule after running transcribe.so through two model migrations and a provider pricing change: take your honest inference estimate and double it for year-one run cost. I've yet to regret that multiplier.

The most expensive agent is not the one with the biggest model. It's the one nobody instrumented, running a frontier model on tasks a small model could do, retrying silently, with no one noticing until the invoice lands.

A concrete example from my own stack

On goodlisten.co, episode processing chains several LLM steps: chaptering, highlight extraction, description generation. My first version ran everything on a frontier model because that's what the prototype used. The output was great; the unit economics were not — long podcast episodes are token-heavy, and at frontier pricing each episode cost more to process than a user would plausibly generate in revenue for months.

The fix was routing, not cutting the AI: the chaptering pass moved to a smaller, cheaper model (it's a segmentation task — small models do it fine), the frontier model kept only the steps where quality is the product. Per-episode cost dropped roughly 70% with no measurable quality loss on my eval set. That one routing decision was worth more than any infrastructure optimization I did that quarter.

Designing for cost from day one

You can't bolt cost-efficiency onto an agent after launch without re-architecting. These are the decisions to make before you ship.

Right-size the model per step, not per product

Decompose the agent's work and ask of each step: what's the cheapest model that passes the eval? Classification, extraction, routing, and formatting steps almost never need a frontier model. Reserve the expensive model for synthesis and judgment — the steps users actually notice. In my pipelines the split usually lands around 80% of calls on small models, 20% on large, which inverts the cost curve.

Cache aggressively, at two layers

  • Prompt caching. If your system prompt and tool definitions are long (agents' usually are), provider-side prompt caching cuts the repeated-prefix cost dramatically. This is nearly free money — it just requires structuring prompts so the stable part comes first.
  • Result caching. Many agent tasks are idempotent. Same input document, same question, same answer. A content-hash cache in front of the pipeline means you never pay twice for identical work. On transcribe.so, deduplicating repeated processing of identical inputs was one of the highest-ROI changes I shipped.

Make retries bounded and observable

Unbounded retry loops are how a $200 day becomes a $2,000 day. Every retry path needs a cap, an exponential backoff, and a metric. If you can't answer "what fraction of yesterday's spend was retries?", you don't have cost observability — you have a bill.

Budget evals as a fixed line item, not an afterthought

Build the regression set during development, when you have the failure cases fresh. Then accept that every prompt change costs an eval run, and price that into how often you change prompts. A 200-case eval set on a mid-tier model costs single-digit dollars per run — trivially cheap insurance against shipping a regression to paying customers.

Plan the migration before you need it

Pin model versions explicitly. Keep prompts in version control with the eval scores they achieved. When the deprecation email arrives — and it will — a migration becomes a re-run of your eval suite against the new model rather than an archaeology project. I've lived through platform migrations at Spotify scale and at solo-founder scale; the discipline is identical, only the headcount differs.

What this means if you're buying, not building

If you're hiring an agency or contractor to build an agent, the build quote is the smaller number. Ask the questions the proposal probably doesn't answer:

  • What's the projected monthly inference cost at my expected volume — including retries?
  • Which steps run on which models, and why?
  • What does the eval suite look like, and who runs it after handoff?
  • What happens when the underlying model is deprecated?

A builder who can't answer those is pricing a demo, not a product. Back when I did partner consulting work, I saw these handoffs from both sides — the painful ones all shared the same root cause: nobody owned the run cost.

Frequently Asked Questions

How much should I budget for running an AI agent versus building it?

As a rough planning rule, expect year-one run cost to land between 50% and 150% of the build cost for a moderate-volume agent, and to recur annually. The wide range reflects volume and model choice. Whatever your inference estimate is, doubling it gets you much closer to reality than adding a 10% buffer.

Can't I just use a cheaper model everywhere to cut costs?

Not everywhere — but almost certainly in more places than you currently do. The right move is per-step routing: run each pipeline step on the cheapest model that passes your eval set. In my experience the majority of agent steps (classification, extraction, formatting) pass on small models, and the savings routinely reach 50–70% with no quality loss users can detect.

Do I really need evals for a simple agent?

Yes, and the simpler the agent, the cheaper the eval suite — so there's no excuse. Without evals, every prompt tweak and every model upgrade is an uncontrolled experiment on your customers. A small regression set built from real failure cases costs a few dollars per run and is the only thing standing between you and silent quality drift.

What's the most common cost surprise in production agents?

Retries and silent reprocessing. Failed calls that retry at full price, duplicate work on identical inputs, and unbounded loops in tool-use chains. None of these show up in a prototype because prototypes run a handful of times under supervision. Bounded retries, result caching, and per-route spend metrics catch all three.

If you're planning an agent and want the run-cost math done honestly before you commit a budget — or you have one in production whose bill stopped making sense — book a call and I'll walk through it with you.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.