Scripts Directory
This directory contains Python scripts for processing the Biweekly Plex Dispatch archive.
Scripts
extract_posts.py
Main extraction script - Converts HTML issue files into an Obsidian-compatible markdown wiki.
Features:
- Extracts individual posts from HTML files (each containing multiple posts)
- Generates author index pages with post lists
- Categorizes posts into 30 topic areas using confidence-based detection
- Creates year-based navigation
- Handles author name consolidation and Unicode characters
- Produces cross-linked markdown files with proper wikilink syntax
Input: HTML files in /issues
directory
Output: Markdown files in /posts
, /people
, /topics
, /years
directories
cleanup.py
Regeneration helper script - Safely removes generated content while preserving source files.
Features:
- Removes generated directories:
posts/
,people/
,topics/
,years/
- Removes generated
README.md
at project root - Preserves source files and directories:
issues/
,scripts/
,venv/
,.git/
- Shows summary of remaining files after cleanup
Usage: Always run before regenerating the archive to ensure clean output.
requirements.txt
Python dependencies - Required packages for running the extraction scripts.
Key dependencies:
beautifulsoup4
- HTML parsinglxml
- XML/HTML parser backend- Standard library modules (no additional installation needed)
Usage Workflow
-
Setup environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Generate archive:
# From project root python3 scripts/cleanup.py # Remove previous output python3 scripts/extract_posts.py # Generate new archive
-
Expected output:
✅ Archive created successfully! 📊 Extracted 614 posts from 52 authors across 4 years 🏷️ Identified 30 topics across all posts
Development
Modifying Topic Detection
Edit the COMMON_TOPICS
dictionary in extract_posts.py
to add new categories or adjust keywords.
Author Management
Update AUTHOR_CONSOLIDATION
mapping and KNOWN_AUTHORS
set in extract_posts.py
for name changes or additions.
File Structure
The scripts expect this project structure:
/
├── issues/ # HTML source files (input)
├── posts/ # Generated post files (output)
├── people/ # Generated author pages (output)
├── topics/ # Generated topic pages (output)
├── years/ # Generated year pages (output)
└── scripts/ # This directory
Troubleshooting
"No HTML files found" - Ensure HTML files are in /issues
directory with .html
extension
"Module not found" - Run pip install -r requirements.txt
in activated virtual environment
"Permission denied" - Check file permissions and that no files are open in other applications
Incomplete extraction - Run cleanup.py
first to remove partial output, then re-run extract_posts.py