What makes sunmoon.dev different from an agency?

You work directly with me — a SaaS founder who builds and operates his own AI products (transcribe.so, goodlisten.co). There's no account manager, no junior bench, and no hand-off. The person who scopes your build is the person who ships it.

How long does it take to build an AI application?

It depends on complexity. A basic AI agent can ship in 4–6 weeks; more involved solutions with custom features take 3–6 months. You'll get a concrete timeline on the first call before any commitment.

Do you provide ongoing support after launch?

Yes. Every build includes 6 months of email support, and I offer a discounted on-call package for extended maintenance and feature work as you scale.

What technologies do you use?

Industry-standard, mostly open tooling to avoid vendor lock-in: Next.js and Node for the app layer, Postgres/Supabase for data, and the right AI models picked per task (OpenAI, Qwen, Mistral, and others). Exact stack is chosen around your requirements.

How do you handle data privacy and security?

Encryption in transit and at rest, scoped access, and compliance-aware design. Where it matters, solutions can be deployed on your own infrastructure so you keep full control of your data.

Transparent and upfront. Consulting starts at $200/hour; AI agent and full-stack SaaS builds start at $10,000, with final cost depending on scope. You get a detailed quote after the first call.

Building goodlisten.co: AI Podcast Discovery and a Creator Studio

You have a podcast you like. You want three more like it. So you type its name into a search box, get a list of shows that share a tag or a guest, and none of them feel right. The problem isn't your taste. The problem is that almost every podcast discovery tool searches metadata — titles, categories, descriptions someone wrote in 2019 — when what you care about lives inside the audio.

That gap is why I built goodlisten.co. This is a case study of how it works under the hood: how discovery runs on embeddings instead of keywords, how long episodes get turned into chapters, highlights, and clips, and how the audio-plus-NLP stack fits together. If you're building anything that has to understand spoken content, the trade-offs here are the same ones you'll hit.

Why metadata search fails for audio

Podcasts are a brutal discovery problem. A single episode can be two hours long, cover six topics, and carry none of that in its title. The category taxonomy is coarse ("Technology," "Society & Culture") and self-assigned. So keyword search collapses into a popularity contest: you find what's already famous, not what's relevant to the thing you liked.

The fix is to stop matching strings and start matching meaning. Concretely:

Transcribe everything. You can't reason about audio you haven't turned into text.
Embed the content, not the title. Represent episodes (and segments within them) as vectors in a semantic space.
Retrieve by similarity. "More like this" becomes a nearest-neighbor query, not a tag filter.

None of this is exotic anymore. The hard part is doing it at podcast scale — millions of minutes of audio — cheaply enough that the product is still viable.

The lineage: goodlisten started where transcribe.so left off

I didn't build the audio pipeline from scratch. I'd already built one.

When I built transcribe.so, I spent a long time on the unglamorous parts of turning speech into structured, trustworthy text: chunking long files so they fit model context windows, aggregating chunk-level output back into a coherent transcript, handling speaker boundaries, and recovering gracefully when a single chunk fails instead of poisoning the whole job. That work became the foundation goodlisten sits on. Discovery is only as good as the transcript underneath it, and I already had a transcription stack I trusted.

The unsexy lesson from both products: most of what reads as "AI audio" quality is data plumbing, not the model. Clean chunking and reliable aggregation beat a fancier model on a messy pipeline every time.

If you're early on a similar build, that's the order of operations I'd recommend — get transcription boringly reliable first, then layer intelligence on top. The reverse never works.

How discovery works

Here's the discovery path end to end:

Ingest + transcribe. An episode comes in, gets chunked, transcribed, and reassembled into a clean transcript with timestamps.
Segment. I split the transcript into topical segments rather than fixed-length windows — a two-hour interview is really several distinct conversations stitched together.
Embed. Each segment and each episode gets an embedding. Storing segment-level vectors (not just one per episode) is what makes "find the part where they talk about X" possible.
Index. Vectors go into a vector store with metadata filters (language, length, recency) layered on top.
Retrieve + rerank. A query — whether it's a typed search or an implicit "more like this episode" — runs as approximate nearest-neighbor search, then a reranking pass tightens the top results.

The thing people underestimate is chunking strategy for embeddings. Embed too coarsely (one vector per episode) and you blur six topics into mush. Embed too finely (one vector per sentence) and you drown in noise and cost. Topical segmentation is the sweet spot, and getting it right is most of the quality.

Embeddings vs. keyword search, concretely

Dimension	Keyword / metadata search	Embedding-based discovery
Matches on	Exact strings, tags	Semantic meaning
"More like this"	Shared tag or guest	Nearest neighbors in vector space
Finds the moment?	No — episode level only	Yes — segment level
Handles synonyms / paraphrase	Poorly	Natively
Main cost driver	Storage, trivial compute	Embedding compute + vector storage
Failure mode	Misses relevant, surfaces popular	Occasional semantic drift; needs reranking

Neither is strictly better for every case — keyword filters are still great for hard constraints like language or date. The win is using embeddings for relevance and metadata for constraints, not picking one.

The creator studio: long episodes into shippable pieces

Discovery is the listener side. The other half of goodlisten is for creators: the same transcript-plus-NLP stack, pointed at a different job.

A two-hour episode is a goldmine that almost nobody mines, because the labor of finding the good 90 seconds is enormous. So the studio does it automatically:

Chapters. Topical segmentation (the same machinery discovery uses) becomes navigable chapter markers with titles.
Highlights. I score segments for "clip-worthiness" — self-contained, quotable, emotionally or informationally dense — and surface the best ones.
Clips. Highlights get turned into short, shareable cuts with accurate captions, ready for the places short audio and video actually travel.

The reuse here is the whole point. One segmentation pass powers both "help me discover episodes" and "help me chop up my episode."

The stack, and what I'd tell you to copy

I'll keep this concrete rather than name-dropping every library:

Transcription layer — chunked ASR with robust aggregation, inherited from the transcribe.so work.
NLP layer — topical segmentation, embedding generation, and an LLM pass for titling chapters and scoring highlights.
Vector retrieval — an approximate-nearest-neighbor index with metadata filters and a reranking stage.
Async everything — none of this happens in a request/response cycle. Ingestion is a queue of jobs, because a single episode can take minutes to process and you cannot block a user on that.

That last point is the one I see people get wrong most. Audio processing is slow and bursty. If you try to do it synchronously, your product falls over the first time someone uploads a three-hour episode. Treat it as a pipeline of durable background jobs from day one.

My background shaped these instincts more than any framework did. At Spotify I worked on systems where audio and recommendations were the entire business, so I have strong priors about what scales. At Klarna and at a Y Combinator–backed startup, I learned to ship the version that works before the version that's elegant. And as a former Supabase Expert Partner, I lean on Postgres — including vector search in Postgres — far more than people expect for products at this stage.

What this means if you're building something similar

Take three things from this:

Reliability beats cleverness. Your transcript quality caps everything downstream. Fix that first.
Segment-level representation is the unlock. Whether the job is search or summarization, the value is in finding the moment, and that requires sub-episode granularity.
Build the representation once, sell it twice. The same NLP layer serves listeners and creators — don't build two stacks.

Frequently Asked Questions

How is embedding-based podcast discovery different from a normal search engine?

A search engine matches strings — you find episodes whose title or tags contain your words. Embedding-based discovery matches meaning, so it can surface an episode that never uses your exact phrasing but covers the same idea. On goodlisten.co I store embeddings at the segment level, which also lets it find the specific part of an episode that's relevant, not just the episode as a whole.

Yes. Discovery quality is capped by transcript quality, so transcription comes first. That stack is inherited directly from transcribe.so, where I'd already solved chunking long audio and reliably reassembling it.

How do you turn a two-hour episode into clips automatically?

The same topical segmentation that powers discovery splits the episode into self-contained segments. I then score each segment for "clip-worthiness" — how quotable, self-contained, and dense it is — and turn the top ones into captioned, shareable cuts. It's one representation reused for two jobs.

Can I build this on Postgres, or do I need a dedicated vector database?

For most products at the stage I work with, Postgres with vector search handles it well, and keeping retrieval next to your relational data simplifies a lot. As a former Supabase Expert Partner I default to that until there's a measured reason to move. A dedicated vector store earns its place at large scale or with very high query volume — not on day one.

If you're building an AI audio or discovery product and want a second set of eyes on the architecture before you commit, book a call.

Have something that needs shipping?