HTML Media Extractor and Archiver
A Python script for extracting media URLs from HTML files and optionally downloading them while preserving directory structure.
Features
-
Comprehensive URL Detection: Finds media URLs from multiple sources including:
<img src="">
attributes<video src="">
andposter=""
attributes<source src="">
elements (for audio/video)srcset
attributes (responsive images)data-src
attributes (lazy loading)- CSS
url()
references in style attributes and tags - CSS background-image properties
-
Template Placeholder Support: Automatically replaces
__GHOST_URL__
placeholders with your base URL -
Smart File Organization: When downloading files:
- Preserves directory structure based on URL paths
- Creates folder hierarchy mimicking source domains
- Sanitizes filenames for safe filesystem storage
- Handles relative URLs when base URL is provided
-
Robust Downloading:
- Connection pooling for improved performance
- Comprehensive error handling
- Progress reporting with detailed output
- Skips already downloaded files
- Respectful delays between requests
Installation
Prerequisites
- Python 3.6 or higher
- Required packages:
pip3 install beautifulsoup4 requests
Download
Save the script as html_media_extractor.py
and make it executable:
chmod +x html_media_extractor.py
Usage
Basic Syntax
python3 html_media_extractor.py [directory] [options]
Arguments
directory
: Directory containing HTML files (required)-o, --output
: Output file for URLs (default:media_urls.txt
)-d, --download
: Download directory (optional)-b, --base-url
: Base URL for resolving relative URLs and replacing__GHOST_URL__
Examples
1. Extract URLs Only
Extract all media URLs from HTML files and save to text file:
python3 html_media_extractor.py /path/to/html/files
This creates media_urls.txt
with all found URLs.
2. Extract and Download
Extract URLs and download all media files:
python3 html_media_extractor.py /path/to/html/files -d ./media_archive
3. With Base URL (Recommended)
Handle relative URLs and __GHOST_URL__
placeholders:
python3 html_media_extractor.py /path/to/html/files \
-b https://example.com \
-d ./archive \
-o extracted_urls.txt
4. For Ghost CMS / Plex Files
Specifically for Ghost CMS exports or similar:
python3 html_media_extractor.py issues \
-d ./plex_media_archive \
-b https://plex.collectivesensecommons.org \
-o plex_urls.txt
Output
URL List File
The script creates a text file containing all unique URLs found, sorted alphabetically:
https://example.com/content/images/2025/09/image1.jpg
https://example.com/content/media/2025/09/video1.mp4
/content/images/local-image.png
Download Directory Structure
When downloading, files are organized by domain and path:
media_archive/
├── example.com/
│ ├── content/
│ │ ├── images/
│ │ │ └── 2025/
│ │ │ └── 09/
│ │ │ └── image1.jpg
│ │ └── media/
│ │ └── 2025/
│ │ └── 09/
│ │ └── video1.mp4
│ └── static/
│ └── style.css
Console Output
The script provides detailed progress information:
Found 3 HTML files to process
Processing: article1.html
Replaced __GHOST_URL__ with https://example.com
Found 12 media URLs
Processing: article2.html
Found 8 media URLs
Writing 18 unique URLs to media_urls.txt
Downloading files to ./archive
[1/18] Processing: https://example.com/image1.jpg
Downloading: https://example.com/image1.jpg -> archive/example.com/image1.jpg
[2/18] Processing: https://example.com/video1.mp4
Already exists: archive/example.com/video1.mp4
Supported URL Types
Image Formats
- Standard img tags:
<img src="image.jpg">
- Responsive images:
<img srcset="image-small.jpg 480w, image-large.jpg 800w">
- Lazy loading:
<img data-src="image.jpg">
Video/Audio Formats
- Video sources:
<video src="video.mp4">
- Video posters:
<video poster="thumbnail.jpg">
- Source elements:
<source src="audio.mp3">
CSS References
- Style attributes:
<div style="background-image: url('bg.jpg')">
- Style tags:
<style>.class { background: url('bg.jpg'); }</style>
Template Placeholders
- Ghost CMS:
__GHOST_URL__/content/images/image.jpg
Error Handling
The script handles various error conditions gracefully:
- Missing directories: Clear error message if HTML directory doesn't exist
- Unreadable files: Continues processing other files if one fails
- Download failures: Reports errors but continues with remaining files
- Network issues: Timeout handling and retry logic
- Invalid URLs: Filters out malformed URLs and data URIs
Performance Notes
- Uses connection pooling to improve download performance
- Includes small delays between requests to be respectful to servers
- Skips files that already exist locally
- Processes files in batches for memory efficiency
Troubleshooting
No URLs Found
- Check that HTML files contain media references
- Verify the base URL is correct for relative links
- Ensure HTML files are valid and parseable
Download Failures
- Check internet connection
- Verify URLs are accessible (not behind authentication)
- Some servers may block automated requests
Permission Errors
- Ensure write permissions in output directory
- Check that filenames don't contain invalid characters
License
This script is provided as-is for archival and backup purposes. Respect copyright and terms of service when downloading content.