383 lines
21 KiB
Markdown
383 lines
21 KiB
Markdown
# 🎤 Karaoke Video Downloader
|
||
|
||
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration.
|
||
|
||
## ✨ Features
|
||
- 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist
|
||
- 📂 **Organized Storage**: Each channel gets its own folder in `downloads/`
|
||
- 📝 **Robust Tracking**: Tracks all downloads, statuses, and formats in JSON
|
||
- 🏆 **Songlist Prioritization**: Prioritize or restrict downloads to a custom songlist
|
||
- 🔄 **Batch Saving & Caching**: Efficient, minimizes API calls
|
||
- 🏷️ **ID3 Tagging**: Adds artist/title metadata to MP4 files
|
||
- 🧹 **Automatic Cleanup**: Removes extra yt-dlp files
|
||
- 📈 **Real-Time Progress**: Detailed console and log output
|
||
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
|
||
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
|
||
- 🧩 **Fuzzy Matching**: Optionally use fuzzy string matching for songlist-to-video matching (with --fuzzy-match, requires rapidfuzz for best results)
|
||
- ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
|
||
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
|
||
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
|
||
- 🛡️ **Robust Interruption Handling**: Progress is saved after each download, preventing re-downloads if the process is interrupted
|
||
- ⚡ **Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
|
||
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
|
||
- ⚡ **Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup)
|
||
|
||
## 🏗️ Architecture
|
||
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
|
||
|
||
### Core Modules:
|
||
- **`downloader.py`**: Main orchestrator and CLI interface
|
||
- **`video_downloader.py`**: Core video download execution and orchestration
|
||
- **`tracking_manager.py`**: Download tracking and status management
|
||
- **`download_planner.py`**: Download plan building and channel scanning
|
||
- **`cache_manager.py`**: Cache operations and file I/O management
|
||
- **`channel_manager.py`**: Channel and file management operations
|
||
- **`songlist_manager.py`**: Songlist operations and tracking
|
||
- **`server_manager.py`**: Server song availability checking
|
||
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
|
||
|
||
### Utility Modules (v3.2):
|
||
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
|
||
- **`error_utils.py`**: Standardized error handling and formatting
|
||
- **`download_pipeline.py`**: Abstracted download → verify → tag → track pipeline
|
||
- **`id3_utils.py`**: ID3 tagging utilities
|
||
- **`config_manager.py`**: Configuration management
|
||
- **`resolution_cli.py`**: Resolution checking utilities
|
||
- **`tracking_cli.py`**: Tracking management CLI
|
||
|
||
### New Utility Modules (v3.3):
|
||
- **`parallel_downloader.py`**: Parallel download management with thread-safe operations
|
||
- `ParallelDownloader` class: Manages concurrent downloads with configurable workers
|
||
- `DownloadTask` and `DownloadResult` dataclasses: Structured task and result management
|
||
- Thread-safe progress tracking and error handling
|
||
- Automatic retry mechanism for failed downloads
|
||
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
|
||
- `sanitize_filename()`: Create safe filenames from artist/title
|
||
- `generate_possible_filenames()`: Generate filename patterns for different modes
|
||
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
|
||
- `is_valid_mp4_file()`: Validate MP4 files with header checking
|
||
- `cleanup_temp_files()`: Remove temporary yt-dlp files
|
||
- `ensure_directory_exists()`: Safe directory creation
|
||
|
||
- **`song_validator.py`**: Centralized song validation logic
|
||
- `SongValidator` class: Unified logic for checking if songs should be downloaded
|
||
- `should_skip_song()`: Comprehensive validation with multiple criteria
|
||
- `mark_song_failed()`: Consistent failure tracking
|
||
- `handle_download_failure()`: Standardized error handling
|
||
|
||
- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
|
||
- `ConfigManager` class: Type-safe configuration loading and caching
|
||
- `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
|
||
- Configuration validation and merging with defaults
|
||
- Dynamic resolution updates
|
||
|
||
### Benefits:
|
||
- **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
|
||
- **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules
|
||
- **Consistency**: Standardized error messages and processing pipelines
|
||
- **Maintainability**: Changes isolated to specific modules
|
||
- **Testability**: Modular components can be tested independently
|
||
- **Type Safety**: Comprehensive type hints across all new modules
|
||
|
||
## 📋 Requirements
|
||
- **Windows 10/11**
|
||
- **Python 3.7+**
|
||
- **yt-dlp.exe** (in `downloader/`)
|
||
- **mutagen** (for ID3 tagging, optional)
|
||
- **ffmpeg/ffprobe** (for video validation, optional but recommended)
|
||
- **rapidfuzz** (for fuzzy matching, optional, falls back to difflib)
|
||
|
||
## 🚀 Quick Start
|
||
|
||
> **💡 Pro Tip**: For a complete list of all available commands, see `commands.txt` - you can copy/paste any command directly into your terminal!
|
||
|
||
### Download a Channel
|
||
```bash
|
||
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
|
||
```
|
||
|
||
### Download Only Songlist Songs (Fast Mode)
|
||
```bash
|
||
python download_karaoke.py --songlist-only --limit 5
|
||
```
|
||
|
||
### Download with Parallel Processing
|
||
```bash
|
||
python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
|
||
```
|
||
|
||
### Focus on Specific Playlists by Title
|
||
```bash
|
||
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"
|
||
```
|
||
|
||
### Download with Fuzzy Matching
|
||
```bash
|
||
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
|
||
```
|
||
|
||
### Download Latest N Videos Per Channel
|
||
```bash
|
||
python download_karaoke.py --latest-per-channel --limit 5
|
||
```
|
||
|
||
### Download Latest N Videos Per Channel (with fuzzy matching)
|
||
```bash
|
||
python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
|
||
```
|
||
|
||
### Prioritize Songlist in Download Queue
|
||
```bash
|
||
python download_karaoke.py --songlist-priority
|
||
```
|
||
|
||
### Show Songlist Download Progress
|
||
```bash
|
||
python download_karaoke.py --songlist-status
|
||
```
|
||
|
||
### Limit Number of Downloads
|
||
```bash
|
||
python download_karaoke.py --limit 5
|
||
```
|
||
|
||
### Override Resolution
|
||
```bash
|
||
python download_karaoke.py --resolution 1080p
|
||
```
|
||
|
||
### **Reset/Start Over for a Channel**
|
||
```bash
|
||
python download_karaoke.py --reset-channel SingKingKaraoke
|
||
```
|
||
|
||
### **Reset Channel and Songlist Songs**
|
||
```bash
|
||
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
|
||
```
|
||
|
||
### **Clear Channel Cache**
|
||
```bash
|
||
python download_karaoke.py --clear-cache SingKingKaraoke
|
||
python download_karaoke.py --clear-cache all
|
||
```
|
||
|
||
## 🧠 Songlist Integration
|
||
- Place your prioritized song list in `data/songList.json` (see example format below).
|
||
- The tool will match and prioritize these songs across all available channel videos.
|
||
- Use `--songlist-only` to download only these songs, or `--songlist-priority` to prioritize them in the queue.
|
||
- Use `--songlist-focus` to download only songs from specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`).
|
||
- Download progress for the songlist is tracked globally in `data/songlist_tracking.json`.
|
||
|
||
#### Example `data/songList.json`
|
||
```json
|
||
[
|
||
{
|
||
"title": "2025 - Apple Top 50",
|
||
"songs": [
|
||
{ "artist": "Kendrick Lamar & SZA", "title": "luther", "position": 1 },
|
||
{ "artist": "Kendrick Lamar", "title": "Not Like Us", "position": 2 }
|
||
]
|
||
},
|
||
{
|
||
"title": "2024 - Billboard Hot 100",
|
||
"songs": [
|
||
{ "artist": "Taylor Swift", "title": "Cruel Summer", "position": 1 },
|
||
{ "artist": "Billie Eilish", "title": "Happier Than Ever", "position": 2 }
|
||
]
|
||
}
|
||
]
|
||
```
|
||
|
||
## 🛠️ Tracking & Caching
|
||
- **data/karaoke_tracking.json**: Tracks all downloads, statuses, and formats
|
||
- **data/songlist_tracking.json**: Tracks global songlist download progress
|
||
- **data/server_duplicates_tracking.json**: Tracks songs found to be duplicates on the server for future skipping
|
||
- **data/channel_cache.json**: Caches channel video lists for performance
|
||
|
||
## 📂 Folder Structure
|
||
```
|
||
KaroakeVideoDownloader/
|
||
├── commands.txt # Complete CLI commands reference (copy/paste ready)
|
||
├── karaoke_downloader/ # All core Python code and utilities
|
||
│ ├── downloader.py # Main orchestrator and CLI interface
|
||
│ ├── cli.py # CLI entry point
|
||
│ ├── video_downloader.py # Core video download execution and orchestration
|
||
│ ├── tracking_manager.py # Download tracking and status management
|
||
│ ├── download_planner.py # Download plan building and channel scanning
|
||
│ ├── cache_manager.py # Cache operations and file I/O management
|
||
│ ├── channel_manager.py # Channel and file management operations
|
||
│ ├── songlist_manager.py # Songlist operations and tracking
|
||
│ ├── server_manager.py # Server song availability checking
|
||
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
|
||
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
|
||
│ ├── error_utils.py # Standardized error handling and formatting
|
||
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
|
||
│ ├── id3_utils.py # ID3 tagging utilities
|
||
│ ├── config_manager.py # Configuration management with dataclasses
|
||
│ ├── file_utils.py # Centralized file operations and filename handling
|
||
│ ├── song_validator.py # Centralized song validation logic
|
||
│ ├── check_resolution.py # Resolution checker utility
|
||
│ ├── resolution_cli.py # Resolution config CLI
|
||
│ └── tracking_cli.py # Tracking management CLI
|
||
├── data/ # All config, tracking, cache, and songlist files
|
||
│ ├── config.json
|
||
│ ├── karaoke_tracking.json
|
||
│ ├── songlist_tracking.json
|
||
│ ├── channel_cache.json
|
||
│ ├── channels.txt
|
||
│ └── songList.json
|
||
├── downloads/ # All video output
|
||
│ └── [ChannelName]/ # Per-channel folders
|
||
├── logs/ # Download logs
|
||
├── downloader/yt-dlp.exe # yt-dlp binary
|
||
├── tests/ # Diagnostic and test scripts
|
||
│ └── test_installation.py
|
||
├── download_karaoke.py # Main entry point (thin wrapper)
|
||
├── README.md
|
||
├── PRD.md
|
||
├── requirements.txt
|
||
└── download_karaoke.bat # (optional Windows launcher)
|
||
```
|
||
|
||
## 🚦 CLI Options
|
||
|
||
> **📋 Complete Command Reference**: See `commands.txt` for all available commands with examples - perfect for copy/paste!
|
||
|
||
### Key Options:
|
||
- `--file <data/channels.txt>`: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
|
||
- `--songlist-priority`: Prioritize songlist songs in download queue
|
||
- `--songlist-only`: Download only songs from the songlist
|
||
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
|
||
- `--songlist-status`: Show songlist download progress
|
||
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
|
||
- `--resolution <720p|1080p|...>`: Override resolution
|
||
- `--status`: Show download/tracking status
|
||
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
|
||
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
|
||
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
|
||
- `--clear-server-duplicates`: **Clear server duplicates tracking (allows re-checking songs against server)**
|
||
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
|
||
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
|
||
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
|
||
- `--parallel`: Enable parallel downloads for improved speed
|
||
- `--workers <N>`: Number of parallel download workers (1-10, default: 3)
|
||
|
||
## 📝 Example Usage
|
||
|
||
> **💡 For complete examples**: See `commands.txt` for all command variations with explanations!
|
||
|
||
```bash
|
||
# Fast mode with fuzzy matching (no need to specify --file)
|
||
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
|
||
|
||
# Parallel downloads for faster processing
|
||
python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
|
||
|
||
# Latest videos per channel with parallel downloads
|
||
python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5
|
||
|
||
# Traditional full scan (no limit)
|
||
python download_karaoke.py --songlist-only
|
||
|
||
# Channel-specific operations
|
||
python download_karaoke.py --reset-channel SingKingKaraoke
|
||
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
|
||
python download_karaoke.py --clear-cache all
|
||
python download_karaoke.py --clear-server-duplicates
|
||
```
|
||
|
||
## 🏷️ ID3 Tagging
|
||
- Adds artist/title/album/genre to MP4 files using mutagen (if installed)
|
||
|
||
## 🧹 Cleanup
|
||
- Removes `.info.json` and `.meta` files after download
|
||
|
||
## 🛠️ Configuration
|
||
- All options are in `data/config.json` (format, resolution, metadata, etc.)
|
||
- You can edit this file or use CLI flags to override
|
||
|
||
## 📋 Command Reference File
|
||
|
||
**`commands.txt`** contains a comprehensive list of all CLI commands with explanations. This file is designed for easy copy/paste usage and includes:
|
||
- All basic download commands
|
||
- Songlist operations
|
||
- Latest-per-channel downloads
|
||
- Cache and tracking management
|
||
- Reset and cleanup operations
|
||
- Advanced combinations
|
||
- Common workflows
|
||
- Troubleshooting commands
|
||
|
||
> **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
|
||
|
||
## 🔧 Refactoring Improvements (v3.3)
|
||
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
|
||
|
||
### **New Utility Modules (v3.3)**
|
||
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
|
||
- `sanitize_filename()`: Create safe filenames from artist/title
|
||
- `generate_possible_filenames()`: Generate filename patterns for different modes
|
||
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
|
||
- `is_valid_mp4_file()`: Validate MP4 files with header checking
|
||
- `cleanup_temp_files()`: Remove temporary yt-dlp files
|
||
- `ensure_directory_exists()`: Safe directory creation
|
||
|
||
- **`song_validator.py`**: Centralized song validation logic
|
||
- `SongValidator` class: Unified logic for checking if songs should be downloaded
|
||
- `should_skip_song()`: Comprehensive validation with multiple criteria
|
||
- `mark_song_failed()`: Consistent failure tracking
|
||
- `handle_download_failure()`: Standardized error handling
|
||
|
||
- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
|
||
- `ConfigManager` class: Type-safe configuration loading and caching
|
||
- `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
|
||
- Configuration validation and merging with defaults
|
||
- Dynamic resolution updates
|
||
|
||
### **Benefits Achieved**
|
||
- **Eliminated Code Duplication**: ~150 lines of duplicate code removed across modules
|
||
- **Centralized File Operations**: Single source of truth for filename handling and file validation
|
||
- **Unified Song Validation**: Consistent logic for checking if songs should be downloaded
|
||
- **Enhanced Type Safety**: Comprehensive type hints across all new modules
|
||
- **Improved Configuration Management**: Structured configuration with validation and caching
|
||
- **Better Error Handling**: Consistent patterns via centralized utilities
|
||
- **Enhanced Maintainability**: Changes to file operations or song validation only require updates in one place
|
||
- **Improved Testability**: Modular components can be tested independently
|
||
- **Better Developer Experience**: Clear function signatures and comprehensive documentation
|
||
|
||
### **New Parallel Download System (v3.4)**
|
||
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
|
||
- **Configurable concurrency:** Use `--parallel --workers N` to enable parallel downloads with N workers (1-10)
|
||
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
|
||
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
|
||
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
|
||
- **Backward compatibility:** Sequential downloads remain the default when `--parallel` is not used
|
||
- **Performance improvements:** Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
|
||
- **Integrated with all modes:** Works with both songlist-across-channels and latest-per-channel download modes
|
||
|
||
### **Previous Improvements (v3.2)**
|
||
- **Centralized yt-dlp Command Generation**: Standardized command building and execution across all download operations
|
||
- **Enhanced Error Handling**: Structured exception hierarchy with consistent error messages and formatting
|
||
- **Abstracted Download Pipeline**: Reusable download → verify → tag → track process for consistent processing
|
||
- **Download plan pre-scan:** Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
|
||
- **Latest-per-channel plan:** Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
|
||
- **Fast mode with early exit:** When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
|
||
- **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
|
||
- **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
|
||
- **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
|
||
- **Robust interruption handling:** Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
|
||
- **Optimized scanning algorithm:** High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
|
||
- **Enhanced cache management:** Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
|
||
- **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads.
|
||
|
||
## 🐞 Troubleshooting
|
||
- Ensure `yt-dlp.exe` is in the `downloader/` folder
|
||
- Check `logs/` for error details
|
||
- Use `python -m karaoke_downloader.check_resolution` to verify video quality
|
||
- If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH
|
||
- For best fuzzy matching, install rapidfuzz: `pip install rapidfuzz` (otherwise falls back to slower, less accurate difflib)
|
||
|
||
---
|
||
|
||
**Happy Karaoke! 🎤** |