How it's built
The Plaice to Know is a four-stage pipeline that turns the full transcript archive into a ranked, curated map.
1. Extract
The raw input is podscripts.co's transcript archive — 1,268 episodes, ~109,000 timestamped segments, ~6,300 words per episode on average. spaCy's named-entity recognizer pulls every place-shaped string out of the transcripts: 26,091 candidate places, 109,414 (place, episode, timestamp) tuples.
2. Normalize & dedup
Raw NER output is messy. “Convert Garden”, “Cover Garden”, “Covenant Garden”, “the Covent Garden” and “Covent Garden” all show up as separate entries because the source is auto-transcribed audio. A fuzzy-merge pass collapses spelling variants, possessive forms, and transcription errors into 24,865 canonical clusters. Specific blocklists drop bare generics (“Castle”, “Park”), brand names, fictional places, and known noise.
3. Re-validate quotes
The original NER stored a 220-character context window per mention, but only ~62% of those windows actually contain the tagged place. So for each (canonical place, episode, timestamp), we open the original episode JSON and pull a wider ±2-segment window. We reject any appearance where the place name doesn't actually appear in that wider window. That filter validated 101,076 of 109,414 mentions.
4. Rank & pick the lead quote
Each canonical place is scored by log(validated count) × poi-multiplier × specificity-multiplier − rejection-penalty + landmark-bonus. Log-damped count keeps countries from dominating; specificity rewards multi-word names like “Covent Garden” over the one-word “London”. For each entry we then pick the single best validated appearance (longest fact-shaped sentence containing the place name, avoiding ad-read and intro patterns) and extract a single-sentence hook from that window.
5. Manual overrides
Pattern-based auto-drops catch lots of NER noise (acronyms, lowercase common nouns, newspaper suffixes, websites) but a small hand-curated place_overrides.json handles the named exceptions — drops for people that NER caught (Sean Connery, Napoleon, Indiana Jones), coord corrections for famously misgeocoded places (Eiffel Tower in Tennessee, Statue of Liberty in the Philippines), and a short list of fictional/space places that get pulled into the “Fictional & space” toggle.
Numbers
- 1,268 episodes parsed
- 109,414 raw place mentions extracted
- 24,865 canonical place clusters after dedup
- 101,076 mentions validated against a wider transcript window
- 3,391 entries on the live map after auto + manual cleanup
- 96 audited entries shown by default (Show favourites + Recurring)
What's still wrong
Plenty. The geocoder still misplaces some entries (Hamburger → Bremen, Mars → Mars-the-French-village, Natural History Museum → Oxford's rather than London's). The long tail of one-off mentions has unaudited NER survivors (concepts, eras, brands). Some extracted hooks start mid-sentence on episodes where the auto-transcript dropped sentence-final punctuation. Improvements ship as they're spotted.