Site Navigation

For LLMs: zip of all posts.

Edit on GitHub


HTML Media Extractor and Archiver

A Python script for extracting media URLs from HTML files and optionally downloading them while preserving directory structure.

Features

Installation

Prerequisites

pip3 install beautifulsoup4 requests

Download

Save the script as html_media_extractor.py and make it executable:

chmod +x html_media_extractor.py

Usage

Basic Syntax

python3 html_media_extractor.py [directory] [options]

Arguments

Examples

1. Extract URLs Only

Extract all media URLs from HTML files and save to text file:

python3 html_media_extractor.py /path/to/html/files

This creates media_urls.txt with all found URLs.

2. Extract and Download

Extract URLs and download all media files:

python3 html_media_extractor.py /path/to/html/files -d ./media_archive

3. With Base URL (Recommended)

Handle relative URLs and __GHOST_URL__ placeholders:

python3 html_media_extractor.py /path/to/html/files \
  -b https://example.com \
  -d ./archive \
  -o extracted_urls.txt

4. For Ghost CMS / Plex Files

Specifically for Ghost CMS exports or similar:

python3 html_media_extractor.py issues \
  -d ./plex_media_archive \
  -b https://plex.collectivesensecommons.org \
  -o plex_urls.txt

Output

URL List File

The script creates a text file containing all unique URLs found, sorted alphabetically:

https://example.com/content/images/2025/09/image1.jpg
https://example.com/content/media/2025/09/video1.mp4
/content/images/local-image.png

Download Directory Structure

When downloading, files are organized by domain and path:

media_archive/
├── example.com/
│   ├── content/
│   │   ├── images/
│   │   │   └── 2025/
│   │   │       └── 09/
│   │   │           └── image1.jpg
│   │   └── media/
│   │       └── 2025/
│   │           └── 09/
│   │               └── video1.mp4
│   └── static/
│       └── style.css

Console Output

The script provides detailed progress information:

Found 3 HTML files to process

Processing: article1.html
  Replaced __GHOST_URL__ with https://example.com
  Found 12 media URLs

Processing: article2.html
  Found 8 media URLs

Writing 18 unique URLs to media_urls.txt

Downloading files to ./archive

[1/18] Processing: https://example.com/image1.jpg
  Downloading: https://example.com/image1.jpg -> archive/example.com/image1.jpg

[2/18] Processing: https://example.com/video1.mp4
  Already exists: archive/example.com/video1.mp4

Supported URL Types

Image Formats

Video/Audio Formats

CSS References

Template Placeholders

Error Handling

The script handles various error conditions gracefully:

Performance Notes

Troubleshooting

No URLs Found

Download Failures

Permission Errors

License

This script is provided as-is for archival and backup purposes. Respect copyright and terms of service when downloading content.