The Plaice to Know

How it's built

The Plaice to Know is a four-stage pipeline that turns the full transcript archive into a ranked, curated map.

1. Extract

The raw input is podscripts.co's transcript archive — 1,268 episodes, ~109,000 timestamped segments, ~6,300 words per episode on average. spaCy's named-entity recognizer pulls every place-shaped string out of the transcripts: 26,091 candidate places, 109,414 (place, episode, timestamp) tuples.

2. Normalize & dedup

Raw NER output is messy. “Convert Garden”, “Cover Garden”, “Covenant Garden”, “the Covent Garden” and “Covent Garden” all show up as separate entries because the source is auto-transcribed audio. A fuzzy-merge pass collapses spelling variants, possessive forms, and transcription errors into 24,865 canonical clusters. Specific blocklists drop bare generics (“Castle”, “Park”), brand names, fictional places, and known noise.

3. Re-validate quotes

The original NER stored a 220-character context window per mention, but only ~62% of those windows actually contain the tagged place. So for each (canonical place, episode, timestamp), we open the original episode JSON and pull a wider ±2-segment window. We reject any appearance where the place name doesn't actually appear in that wider window. That filter validated 101,076 of 109,414 mentions.

4. Rank & pick the lead quote

Each canonical place is scored by log(validated count) × poi-multiplier × specificity-multiplier − rejection-penalty + landmark-bonus. Log-damped count keeps countries from dominating; specificity rewards multi-word names like “Covent Garden” over the one-word “London”. For each entry we then pick the single best validated appearance (longest fact-shaped sentence containing the place name, avoiding ad-read and intro patterns) and extract a single-sentence hook from that window.

5. Manual overrides

Pattern-based auto-drops catch lots of NER noise (acronyms, lowercase common nouns, newspaper suffixes, websites) but a small hand-curated place_overrides.json handles the named exceptions — drops for people that NER caught (Sean Connery, Napoleon, Indiana Jones), coord corrections for famously misgeocoded places (Eiffel Tower in Tennessee, Statue of Liberty in the Philippines), and a short list of fictional/space places that get pulled into the “Fictional & space” toggle.

Numbers

What's still wrong

Plenty. The geocoder still misplaces some entries (Hamburger → Bremen, Mars → Mars-the-French-village, Natural History Museum → Oxford's rather than London's). The long tail of one-off mentions has unaudited NER survivors (concepts, eras, brands). Some extracted hooks start mid-sentence on episodes where the auto-transcript dropped sentence-final punctuation. Improvements ship as they're spotted.

← Back to the map