Plex Archive Project
This project extracts individual posts from the Biweekly Plex Dispatch HTML archives and creates a Massive Wiki for easy navigation and cross-referencing.
Note: There's a tension between tweaking the extract script and re-running it vs. making edits in the extraction output files. At some point this will be resolved by better automation or deciding never to run the extract script again. Until then, check diffs carefully when making changes to either automation or output files.
The extraction script only generates these files/directories:
- posts/ directory and all markdown files within it
- authors/ directory and all files within it
- topics/ directory and all files within it
- years/ directory and all files within it
- issues/index.html
Project Structure
/
├── posts/ # Individual post markdown files (614 posts)
├── authors/ # Author index pages (53 authors)
├── topics/ # Topic index pages (30 topics)
├── years/ # Year index pages (2022-2025)
├── issues/ # Original HTML source files (87 issues)
├── scripts/ # Python extraction and utility scripts
├── venv/ # Python virtual environment
├── README.md # Main archive navigation (Massive Wiki entry point)
└── Project.md # This technical documentation
Scripts
Main Scripts
scripts/extract_posts.py
- Main extraction script with topic detection and author consolidationscripts/cleanup.py
- Removes generated content for clean regeneration
Requirements
- Python 3.x
- BeautifulSoup4 for HTML parsing
- See
scripts/requirements.txt
for full dependencies
Usage
Setup
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r scripts/requirements.txt
Generate Archive
source venv/bin/activate
python3 scripts/cleanup.py # Clean previous run
python3 scripts/extract_posts.py
Features
Content Organization
- 614 posts from 87 HTML issues (2022-2025)
- 53 authors with name consolidation and individual index pages
- 30 topic categories with improved keyword detection accuracy
- Cross-referencing between posts, authors, topics, and years
Topic Detection
Uses confidence-based scoring with:
- Multi-word phrase matching for specificity
- Occurrence counting to reduce false positives
- 30 categories from general (AI, Climate) to specific (Web3, Bioregionalism)
Author Management
- Hardcoded author list from "Kith and Kin" section
- Name consolidation (Pete → Peter Kaminski, George P → George Pór)
- Proper Unicode handling for international characters
Linking
- People links:
[[Author Name]]
using spaces in filenames - Issue links:
[2022-04-07](https://plex.collectivesensecommons.org/2022-04-07/)
to original Plex website - Topic links:
[[Topic Name]]
to topic index pages - Year links:
[[2022]]
to year index pages
Data Sources
- Original HTML: 87 issue files in
/issues
directory - Ghost Export:
biweekly-plex-dispatch.ghost.2025-09-17-17-08-49.json
(backup) - Known Authors: Extracted from 2025-09-04 "Kith and Kin" post
Development
Adding New Topics
Edit COMMON_TOPICS
dictionary in scripts/extract_posts.py
and regenerate.
Author Name Changes
Update AUTHOR_CONSOLIDATION
mapping and KNOWN_AUTHORS
set.
Regeneration
Always run scripts/cleanup.py
first to remove generated content while preserving source files and scripts.