How to Vet a Freelance AI Engineer (Questions That Actually Filter)
You posted a job for an AI engineer and got forty applications. Every one of them says "LLM expert." Every one of them has a portfolio full of chatbots and a GitHub with three forked repos. You're about to pay someone $80–$200 an hour to build something your business will depend on, and you have no reliable way to tell who can actually do it.
Since 2023, the cost of producing an impressive AI demo has dropped to roughly zero. Anyone can wire GPT to a chat window in an afternoon. The cost of operating an AI feature in production — where real users do unreasonable things, models return garbage, and the bill arrives monthly — has not dropped at all. The entire vetting problem reduces to one question: is this person a builder or a demo-maker?
I sit on both sides of this. I hire contractors for my own products, and I'm the freelance AI engineer founders try to vet. After years as a senior engineer at Spotify and Klarna, time with a Y Combinator–backed startup, and a stretch as a Supabase Expert Partner, I now run my own AI products full-time. The questions below are the ones I'd want a client to ask me — because they're the ones demo-makers can't fake.
Why portfolios don't work anymore
A portfolio shows you that something existed for the duration of a screen recording. It doesn't show you:
- Whether the thing handled a malformed input without crashing
- Whether it still worked the week after the model provider shipped a silent update
- Whether anyone other than the builder ever used it
- What it cost to run at any volume above "me, testing it"
Demo-makers optimize for the recording. Builders optimize for month three. The questions that filter are the ones about month three.
The four questions that actually filter
1. "Show me something you operate in production"
Not "built." Operate. Present tense. Something with real users, a real bill, and a real on-call story.
This single question eliminates most of the field. A demo-maker will show you a Loom of a prototype. A builder will show you a live URL, tell you the monthly active numbers (even if small), and — this is the tell — start volunteering the ugly parts unprompted: "the ASR provider rate-limits us during US business hours, so we built a queue," or "we had to add a fallback because the primary model went down twice in March."
When someone asks me this, I point at transcribe.so, my multi-model transcription product, and goodlisten.co, my AI podcast discovery and creator studio. Both are live, both have paying-traffic-shaped problems, and both have taught me things no client project ever did — because when it's your own infra bill and your own churn, you can't hand the problem back.
Follow-ups that deepen the signal:
- "What broke last month?" (Builders have an immediate answer. Demo-makers say "nothing.")
- "What does it cost to run?" (If they don't know their own unit economics, they won't know yours.)
- "How many users?" (Small honest numbers beat vague big ones.)
2. "How do you evaluate whether the AI is actually working?"
This is the most technical question on the list and the one that most cleanly separates 2023-era prompt tinkerers from engineers.
A weak answer: "I test it with a bunch of prompts and check the outputs look good." That's vibes. Vibes don't survive a model version bump.
A strong answer mentions some concrete subset of:
- A golden dataset — a fixed set of inputs with known-good outputs, run on every change
- Regression evals — so a prompt tweak that improves case A can't silently break case B
- Task-specific metrics — word error rate for transcription, citation accuracy for RAG, exact-match for extraction — not just "the LLM judge liked it"
- Eval-before-upgrade discipline — new model versions get run against the eval suite before they touch production
For transcribe.so I keep a golden audio corpus that runs against the pipeline after any change to chunking, aggregation, or prompts. It has caught regressions that looked fine in spot checks. Anyone who has operated an AI product for more than a few months has independently invented some version of this — and anyone who hasn't will look at you blankly.
3. "What happens when the model is wrong?"
Models are wrong. Not occasionally — routinely, at some percentage, forever. The engineering is not in preventing that; it's in deciding what the product does when it happens.
You're listening for failure-mode thinking:
- Detection: confidence thresholds, output validation, schema checks, citation verification
- Degradation: fallback models, retries with different parameters, graceful "I don't know" states
- Containment: which actions the AI can take autonomously vs. which need a human or a hard rule in front of them
- Honesty in the UI: does the product signal uncertainty to the user, or does it present hallucinations with full confidence?
Demo-makers design for the model being right. The case where it's wrong is the one that emails support, leaves the one-star review, or sues you — designing for that case is the actual job.
If the candidate has never thought about what their system does on a bad output, every failure becomes your incident, discovered by your customers.
4. "Who owns the code, and can I run it without you?"
This one is about your downside. Freelance engagements end — sometimes well, sometimes not — and the asymmetry of a bad ending is brutal if you didn't ask this up front.
Concretely, get clear answers on:
- IP assignment — you own the code outright on payment, in writing, no "license to use"
- Repo location — it lives in your GitHub org from day one, not theirs
- Accounts and keys — model provider accounts, hosting, and DNS are yours; the engineer gets invited in, never the reverse
- Bus factor — is there a README and deployment doc good enough that another competent engineer could take over in a week?
A professional will agree to all of this immediately and probably bring it up before you do. Resistance on any point — especially repo ownership or API keys living in their personal account — is disqualifying, not negotiable.
Red flags, ranked
| Red flag | What it usually means | Severity |
|---|---|---|
| Portfolio is only demos and tutorials, nothing live | Has never operated anything; you fund their learning curve | High |
| "Evals? I just test the prompts manually" | No regression safety; every change is a gamble | High |
| Wants code/keys/repos in their accounts | Lock-in by design; ugly off-boarding | Disqualifying |
| Can't name a single production failure they've handled | Either inexperienced or not honest — both bad | High |
| Quotes a fixed price instantly, no discovery questions | Doesn't understand the problem, or plans to make scope your problem | Medium |
| Promises a specific accuracy number before seeing your data | Selling, not engineering | Medium |
| Every answer name-drops frameworks, none mentions trade-offs | Tutorial-level depth | Medium |
| No opinion on cost per request / unit economics | Has never paid a model bill at volume | Medium |
One medium flag is a conversation topic. Two highs, or any disqualifier, means keep looking — the supply of freelance AI engineers is large; the supply of good ones is small but findable.
A fair fight: how to vet me
It would be convenient to stop here. But the standard has to apply to me too. If you're considering working with me, don't take the résumé on faith — Spotify and Klarna are real, but past employers vouch for nobody's freelance work, including mine.
Instead, vet me on proof-of-work:
- Question 1: transcribe.so and goodlisten.co are live. Sign up, upload a file, try to break them. What you experience is my production engineering, unfiltered by a portfolio page.
- Question 2: ask me about my eval setups on a call. I'll show you the golden corpus, not describe it.
- Question 3: ask me what broke last quarter. I'll tell you, specifically, including the part that was my fault.
- Question 4: your repo, your org, your keys, IP assigned on payment. Standard terms, in the contract.
This generalizes: good freelancers welcome hard vetting, because it's where they beat the demo-makers undercutting them on rate. If a candidate gets defensive under these questions, that's your answer too.
Frequently Asked Questions
How long should vetting a freelance AI engineer take?
One focused hour of conversation plus an hour of checking their live work is usually enough — the four questions above produce strong signal fast. A small paid trial task (a few days, real scope, real codebase) is worth far more than a third interview. Skip unpaid "test projects"; good engineers decline them and you select for desperation.
Should I test them with a take-home coding challenge?
For AI work specifically, a generic LeetCode-style test measures almost nothing relevant. If you want a work sample, pay for a tightly scoped real task: "add an eval harness for this one prompt" or "instrument the cost per request on this endpoint." You'll see how they think about exactly the problems that matter.
What's a fair rate, and is cheaper riskier?
Experienced AI engineers who operate production systems typically charge $100–$250/hour or equivalent project rates, varying by region. Below that range you're mostly buying demo-makers, and the true cost shows up later as a rewrite. The expensive engineer who ships a maintainable system once is almost always cheaper than two cheap attempts.
What if I can't evaluate the technical answers myself?
Borrow judgment: have a technical friend or advisor sit in on one call, or hire a fractional CTO for a few hours to run the vetting. Failing that, weight question 1 and question 4 heavily — "show me what you operate" and "who owns the code" are verifiable by any founder, no ML background required.
If you'd rather start from proof-of-work than a stack of résumés, book a call and bring your hardest questions — I'll answer the same four I just gave you.
Have something that needs shipping?
I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.