sunmoon.dev
All writing

Building goodlisten.co: AI Podcast Discovery and a Creator Studio

Seunghun Lee
case studygoodlisten.coAIproduct

You have a podcast you genuinely like. You want three more like it. So you type its name into a search box, get a list of shows that share a tag or a guest, and none of them feel right. The problem isn't your taste. The problem is that almost every podcast discovery tool searches metadata — titles, categories, descriptions someone wrote in 2019 — when what you actually care about lives inside the audio.

That gap is why I built goodlisten.co. This is a case study of how it works under the hood: how discovery runs on embeddings instead of keywords, how long episodes get turned into chapters, highlights, and clips, and how the audio-plus-NLP stack fits together. If you're building anything that has to understand spoken content, the trade-offs here are the same ones you'll hit.

Why metadata search fails for audio

Podcasts are a brutal discovery problem. A single episode can be two hours long, cover six topics, and carry none of that in its title. The category taxonomy is coarse ("Technology," "Society & Culture") and self-assigned. So keyword search collapses into a popularity contest: you find what's already famous, not what's actually relevant to the thing you liked.

The fix is to stop matching strings and start matching meaning. Concretely:

  • Transcribe everything. You can't reason about audio you haven't turned into text.
  • Embed the content, not the title. Represent episodes (and segments within them) as vectors in a semantic space.
  • Retrieve by similarity. "More like this" becomes a nearest-neighbor query, not a tag filter.

None of this is exotic anymore. The hard part is doing it at podcast scale — millions of minutes of audio — cheaply enough that the product is still viable.

The lineage: goodlisten started where transcribe.so left off

I didn't build the audio pipeline from scratch. I'd already built one.

When I built transcribe.so, I spent a long time on the unglamorous parts of turning speech into structured, trustworthy text: chunking long files so they fit model context windows, aggregating chunk-level output back into a coherent transcript, handling speaker boundaries, and recovering gracefully when a single chunk fails instead of poisoning the whole job. That work became the foundation goodlisten sits on. Discovery is only as good as the transcript underneath it, and I already had a transcription stack I trusted.

The unsexy lesson from both products: 80% of "AI audio" quality is data plumbing, not the model. Clean chunking and reliable aggregation beat a fancier model on a messy pipeline every time.

If you're early on a similar build, that's the order of operations I'd recommend — get transcription boringly reliable first, then layer intelligence on top. The reverse never works.

How discovery actually works

Here's the discovery path end to end:

  1. Ingest + transcribe. An episode comes in, gets chunked, transcribed, and reassembled into a clean transcript with timestamps.
  2. Segment. I split the transcript into topical segments rather than fixed-length windows — a two-hour interview is really several distinct conversations stitched together.
  3. Embed. Each segment and each episode gets an embedding. Storing segment-level vectors (not just one per episode) is what makes "find the part where they talk about X" possible.
  4. Index. Vectors go into a vector store with metadata filters (language, length, recency) layered on top.
  5. Retrieve + rerank. A query — whether it's a typed search or an implicit "more like this episode" — runs as approximate nearest-neighbor search, then a reranking pass tightens the top results.

The thing people underestimate is chunking strategy for embeddings. Embed too coarsely (one vector per episode) and you blur six topics into mush. Embed too finely (one vector per sentence) and you drown in noise and cost. Topical segmentation is the sweet spot, and getting it right is most of the quality.

Embeddings vs. keyword search, concretely

Dimension Keyword / metadata search Embedding-based discovery
Matches on Exact strings, tags Semantic meaning
"More like this" Shared tag or guest Nearest neighbors in vector space
Finds the moment? No — episode level only Yes — segment level
Handles synonyms / paraphrase Poorly Natively
Main cost driver Storage, trivial compute Embedding compute + vector storage
Failure mode Misses relevant, surfaces popular Occasional semantic drift; needs reranking

Neither is strictly better for every case — keyword filters are still great for hard constraints like language or date. The win is using embeddings for relevance and metadata for constraints, not picking one.

The creator studio: long episodes into shippable pieces

Discovery is the listener side. The other half of goodlisten is for creators: the same transcript-plus-NLP stack, pointed at a different job.

A two-hour episode is a goldmine that almost nobody mines, because the labor of finding the good 90 seconds is enormous. So the studio does it automatically:

  • Chapters. Topical segmentation (the same machinery discovery uses) becomes navigable chapter markers with titles.
  • Highlights. I score segments for "clip-worthiness" — self-contained, quotable, emotionally or informationally dense — and surface the best ones.
  • Clips. Highlights get turned into short, shareable cuts with accurate captions, ready for the places short audio and video actually travel.

The reuse here is the whole point. One segmentation pass powers both "help me discover episodes" and "help me chop up my episode." Build the representation once; sell it to both sides of the marketplace.

The stack, and what I'd tell you to copy

I'll keep this concrete rather than name-dropping every library:

  • Transcription layer — chunked ASR with robust aggregation, inherited from the transcribe.so work.
  • NLP layer — topical segmentation, embedding generation, and an LLM pass for titling chapters and scoring highlights.
  • Vector retrieval — an approximate-nearest-neighbor index with metadata filters and a reranking stage.
  • Async everything — none of this happens in a request/response cycle. Ingestion is a queue of jobs, because a single episode can take minutes to process and you cannot block a user on that.

That last point is the one I see people get wrong most. Audio processing is slow and bursty. If you try to do it synchronously, your product falls over the first time someone uploads a three-hour episode. Treat it as a pipeline of durable background jobs from day one.

My background shaped these instincts more than any framework did. At Spotify I worked on systems where audio and recommendations were the entire business, so I have strong priors about what scales. At Klarna and at a Y Combinator–backed startup, I learned to ship the version that works before the version that's elegant. And as a former Supabase Expert Partner, I lean on Postgres — including vector search in Postgres — far more than people expect for products at this stage. (To be clear, those were my roles; none of those companies endorse this studio or its products.)

What this means if you're building something similar

If you take three things from this:

  • Reliability beats cleverness. Your transcript quality caps everything downstream. Fix that first.
  • Segment-level representation is the unlock. Whether the job is search or summarization, the value is in finding the moment, and that requires sub-episode granularity.
  • Build the representation once, monetize it twice. The same NLP layer can serve listeners and creators. Don't build two stacks.

Frequently Asked Questions

How is embedding-based podcast discovery different from a normal search engine?

A search engine matches strings — you find episodes whose title or tags contain your words. Embedding-based discovery matches meaning, so it can surface an episode that never uses your exact phrasing but covers the same idea. On goodlisten.co I store embeddings at the segment level, which also lets it find the specific part of an episode that's relevant, not just the episode as a whole.

Do you transcribe every episode before you can recommend it?

Yes. Discovery quality is capped by transcript quality, so transcription comes first. That stack is inherited directly from transcribe.so, where I'd already solved chunking long audio and reliably reassembling it. Recommendations are only as good as the text underneath them.

How do you turn a two-hour episode into clips automatically?

The same topical segmentation that powers discovery splits the episode into self-contained segments. I then score each segment for "clip-worthiness" — how quotable, self-contained, and dense it is — and turn the top ones into captioned, shareable cuts. It's one representation reused for two jobs.

Can I build this on Postgres, or do I need a dedicated vector database?

For most products at the stage I work with, Postgres with vector search handles it well, and keeping retrieval next to your relational data simplifies a lot. As a former Supabase Expert Partner I default to that until there's a measured reason to move. A dedicated vector store earns its place at large scale or with very high query volume — not on day one.


If you're building an AI audio or discovery product and want a second set of eyes on the architecture before you commit, book a call.

Have something that needs shipping?

I'm Seunghun Lee — I design, build, and ship production AI agents and full-stack SaaS. Tell me what you're building.