Scripts Directory
This directory contains Python scripts for processing the Biweekly Plex Dispatch archive.
Scripts
extract_posts.py
Main extraction script - Converts HTML issue files into an Obsidian-compatible markdown wiki.
Features:
- Extracts individual posts from HTML files (each containing multiple posts)
- Generates author index pages with post lists
- Categorizes posts into 30 topic areas using confidence-based detection
- Creates year-based navigation
- Handles author name consolidation and Unicode characters
- Produces cross-linked markdown files with proper wikilink syntax
Input: HTML files in /issues directory
Output: Markdown files in /posts, /people, /topics, /years directories
cleanup.py
Regeneration helper script - Safely removes generated content while preserving source files.
Features:
- Removes generated directories:
posts/,people/,topics/,years/ - Removes generated
README.mdat project root - Preserves source files and directories:
issues/,scripts/,venv/,.git/ - Shows summary of remaining files after cleanup
Usage: Always run before regenerating the archive to ensure clean output.
requirements.txt
Python dependencies - Required packages for running the extraction scripts.
Key dependencies:
beautifulsoup4- HTML parsinglxml- XML/HTML parser backend- Standard library modules (no additional installation needed)
Usage Workflow
-
Setup environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt -
Generate archive:
# From project root python3 scripts/cleanup.py # Remove previous output python3 scripts/extract_posts.py # Generate new archive -
Expected output:
✅ Archive created successfully! 📊 Extracted 614 posts from 52 authors across 4 years 🏷️ Identified 30 topics across all posts
Development
Modifying Topic Detection
Edit the COMMON_TOPICS dictionary in extract_posts.py to add new categories or adjust keywords.
Author Management
Update AUTHOR_CONSOLIDATION mapping and KNOWN_AUTHORS set in extract_posts.py for name changes or additions.
File Structure
The scripts expect this project structure:
/
├── issues/ # HTML source files (input)
├── posts/ # Generated post files (output)
├── people/ # Generated author pages (output)
├── topics/ # Generated topic pages (output)
├── years/ # Generated year pages (output)
└── scripts/ # This directory
Troubleshooting
"No HTML files found" - Ensure HTML files are in /issues directory with .html extension
"Module not found" - Run pip install -r requirements.txt in activated virtual environment
"Permission denied" - Check file permissions and that no files are open in other applications
Incomplete extraction - Run cleanup.py first to remove partial output, then re-run extract_posts.py