629 lines
33 KiB
Markdown
629 lines
33 KiB
Markdown
# 🎤 Karaoke Video Downloader
|
||
|
||
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection.
|
||
|
||
## ✨ Features
|
||
- 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist
|
||
- 📂 **Organized Storage**: Each channel gets its own folder in `downloads/`
|
||
- 📝 **Robust Tracking**: Tracks all downloads, statuses, and formats in JSON
|
||
- 🏆 **Songlist Prioritization**: Prioritize or restrict downloads to a custom songlist
|
||
- 🔄 **Batch Saving & Caching**: Efficient, minimizes API calls
|
||
- 🏷️ **ID3 Tagging**: Adds artist/title metadata to MP4 files
|
||
- 🧹 **Automatic Cleanup**: Removes extra yt-dlp files
|
||
- 📈 **Real-Time Progress**: Detailed console and log output
|
||
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
|
||
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
|
||
- 🧩 **Enhanced Fuzzy Matching**: Advanced fuzzy string matching for songlist-to-video matching with improved video title parsing (handles multiple title formats like "Title Karaoke | Artist Karaoke Version")
|
||
- ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
|
||
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
|
||
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
|
||
- 🛡️ **Robust Interruption Handling**: Progress is saved after each download, preventing re-downloads if the process is interrupted
|
||
- ⚡ **Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
|
||
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
|
||
- ⚡ **Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup)
|
||
- 📊 **Unmatched Songs Reports**: Generate detailed reports of songs that couldn't be found in any channel with `--generate-unmatched-report`
|
||
- 🛡️ **Duplicate File Prevention**: Automatically detects and prevents duplicate files with `(2)`, `(3)` suffixes, with cleanup utility for existing duplicates
|
||
- 🏷️ **Consistent Metadata**: Filename and ID3 tag use identical artist/title format for clear file identification
|
||
- 🍎 **macOS Support**: Automatic platform detection and setup with native macOS binaries and FFmpeg integration
|
||
|
||
## 🏗️ Architecture
|
||
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
|
||
|
||
### Core Modules:
|
||
- **`downloader.py`**: Main orchestrator and CLI interface
|
||
- **`video_downloader.py`**: Core video download execution and orchestration
|
||
- **`tracking_manager.py`**: Download tracking and status management
|
||
- **`download_planner.py`**: Download plan building and channel scanning
|
||
- **`cache_manager.py`**: Cache operations and file I/O management
|
||
- **`channel_manager.py`**: Channel and file management operations
|
||
- **`songlist_manager.py`**: Songlist operations and tracking
|
||
- **`server_manager.py`**: Server song availability checking
|
||
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
|
||
|
||
### Utility Modules (v3.2):
|
||
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
|
||
- **`error_utils.py`**: Standardized error handling and formatting
|
||
- **`download_pipeline.py`**: Abstracted download → verify → tag → track pipeline
|
||
- **`id3_utils.py`**: ID3 tagging utilities
|
||
- **`config_manager.py`**: Configuration management
|
||
- **`resolution_cli.py`**: Resolution checking utilities
|
||
- **`tracking_cli.py`**: Tracking management CLI
|
||
|
||
### New Utility Modules (v3.3):
|
||
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
|
||
- **`song_validator.py`**: Centralized song validation logic for checking if songs should be downloaded
|
||
|
||
### **Unified Download Workflow (v3.4.5)**
|
||
- **`execute_unified_download_workflow()`**: Centralized download execution that all modes use
|
||
- **`_execute_sequential_downloads()`**: Sequential download execution using DownloadPipeline
|
||
- **`_execute_parallel_downloads()`**: Parallel download execution using ParallelDownloader
|
||
|
||
### **Benefits of Enhanced Modular Architecture:**
|
||
- **Single Responsibility**: Each module has a focused purpose
|
||
- **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
|
||
- **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules
|
||
- **Testability**: Individual components can be tested separately
|
||
- **Maintainability**: Easier to find and fix issues
|
||
- **Reusability**: Components can be used independently
|
||
- **Robustness**: Better error handling and interruption recovery
|
||
- **Consistency**: Standardized error messages and processing pipelines
|
||
- **Type Safety**: Comprehensive type hints across all new modules
|
||
- **Unified Execution**: All download modes use the same execution pipeline for consistency
|
||
|
||
## 🔧 Development Guidelines
|
||
|
||
### **Adding New Download Modes**
|
||
When adding new download modes, follow the unified workflow pattern to ensure consistency:
|
||
|
||
#### **1. Build Download Plan (Mode-Specific)**
|
||
```python
|
||
def download_new_mode(self, ...):
|
||
# Build download plan with standard structure
|
||
download_plan = []
|
||
for video in videos_to_download:
|
||
download_plan.append({
|
||
"video_id": video["id"],
|
||
"artist": artist,
|
||
"title": title,
|
||
"filename": filename,
|
||
"channel_name": channel_name,
|
||
"video_title": video["title"],
|
||
"force_download": force_download
|
||
})
|
||
|
||
# Use unified execution workflow
|
||
downloaded_count, success = self.execute_unified_download_workflow(
|
||
download_plan=download_plan,
|
||
cache_file=cache_file,
|
||
limit=limit,
|
||
show_progress=True,
|
||
)
|
||
|
||
return success
|
||
```
|
||
|
||
#### **2. Key Principles**
|
||
- **NEVER implement custom download execution logic** - always use `execute_unified_download_workflow()`
|
||
- **Focus on download plan building** - that's where mode-specific logic belongs
|
||
- **Use the standard download plan structure** for consistency
|
||
- **Implement cache file handling** for progress tracking and resume functionality
|
||
- **Test with both sequential and parallel modes** to ensure compatibility
|
||
|
||
#### **3. Benefits of Unified Architecture**
|
||
- **Consistency**: All modes behave identically for execution, progress tracking, and error handling
|
||
- **Automatic Features**: New modes automatically get parallel downloads, progress tracking, and cache management
|
||
- **Maintainability**: Changes to download execution only need to be made in one place
|
||
- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
|
||
|
||
## 🔧 Recent Improvements (v3.4.1)
|
||
### **Enhanced Fuzzy Matching**
|
||
- **Improved title parsing**: Enhanced `extract_artist_title` function to handle multiple video title formats
|
||
- **Better matching accuracy**: Can now parse titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
|
||
- **Consistent parsing**: All modules now use the same parsing logic from `fuzzy_matcher.py`
|
||
- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
|
||
|
||
### **Fixed Import Conflicts**
|
||
- **Resolved import conflicts**: Updated modules to use the enhanced `extract_artist_title` from `fuzzy_matcher.py`
|
||
- **Consistent behavior**: All parts of the system use the same parsing logic
|
||
- **Cleaner codebase**: Eliminated duplicate code and import conflicts
|
||
|
||
### **Fixed --limit Parameter**
|
||
- **Correct limit application**: The `--limit` parameter now properly limits the scanning phase, not just downloads
|
||
- **Improved performance**: When using `--limit N`, only the first N songs are scanned, significantly reducing processing time
|
||
- **Accurate logging**: Logging messages now show the correct counts for songs that will actually be processed when using `--limit`
|
||
|
||
### **Code Quality Improvements**
|
||
- **Eliminated duplicate functions**: Removed duplicate `extract_artist_title` implementations
|
||
- **Fixed import conflicts**: Resolved inconsistencies between different parsing implementations
|
||
- **Single source of truth**: All title parsing logic is now centralized in `fuzzy_matcher.py`
|
||
|
||
## 🔧 Recent Improvements (v3.4.5)
|
||
### **Unified Download Workflow Architecture**
|
||
- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
|
||
- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
|
||
- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution
|
||
- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation
|
||
- **Unified cache management**: Consistent progress tracking and resume functionality across all modes
|
||
|
||
### **What Was Fixed**
|
||
- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies
|
||
- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented
|
||
- **Code Duplication**: Eliminated duplicate download execution logic across different modes
|
||
- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management
|
||
|
||
### **Benefits**
|
||
- ✅ **Consistency**: All modes behave identically for execution, progress tracking, and error handling
|
||
- ✅ **Maintainability**: Changes to download execution only need to be made in one place
|
||
- ✅ **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
|
||
- ✅ **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
|
||
- ✅ **Testing**: Easier to test since all modes use the same execution logic
|
||
|
||
## 🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)
|
||
### **Duplicate File Prevention**
|
||
- **Enhanced file existence checking**: Now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
|
||
- **Automatic duplicate prevention**: Skips downloads when files already exist (including duplicates)
|
||
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files
|
||
- **Cleanup utility**: `data/cleanup_duplicate_files.py` helps identify and remove existing duplicate files
|
||
|
||
### **Filename vs ID3 Tag Consistency**
|
||
- **Consistent metadata**: Filename and ID3 tag now use identical artist/title format
|
||
- **Removed extra suffixes**: No more "(Karaoke Version)" in ID3 tags that don't match filenames
|
||
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction
|
||
|
||
### **Benefits**
|
||
- ✅ **No more duplicate files** with `(2)`, `(3)` suffixes
|
||
- ✅ **Consistent metadata** between filename and ID3 tags
|
||
- ✅ **Efficient disk usage** by preventing unnecessary downloads
|
||
- ✅ **Clear file identification** with consistent naming
|
||
|
||
### **Clean Up Existing Duplicates**
|
||
```bash
|
||
# Run the cleanup utility to find and remove existing duplicates
|
||
python data/cleanup_duplicate_files.py
|
||
|
||
# Choose option 1 for dry run (recommended first)
|
||
# Choose option 2 to actually delete duplicates
|
||
```
|
||
|
||
## 📋 Requirements
|
||
- **Windows 10/11 or macOS 10.14+**
|
||
- **Python 3.7+**
|
||
- **yt-dlp binary** (platform-specific, see setup instructions below)
|
||
- **mutagen** (for ID3 tagging, optional)
|
||
- **ffmpeg/ffprobe** (for video validation, optional but recommended)
|
||
- **rapidfuzz** (for fuzzy matching, optional, falls back to difflib)
|
||
|
||
## 🍎 macOS Setup
|
||
|
||
### Automatic Setup (Recommended)
|
||
Run the macOS setup script to automatically set up yt-dlp and FFmpeg:
|
||
|
||
```bash
|
||
python3 setup_macos.py
|
||
```
|
||
|
||
This script will:
|
||
- Detect your macOS version
|
||
- Offer installation options for yt-dlp (pip or binary download)
|
||
- Install FFmpeg via Homebrew
|
||
- Test the installation
|
||
|
||
### Manual Setup
|
||
If you prefer to set up manually:
|
||
|
||
#### Option 1: Install yt-dlp via pip
|
||
```bash
|
||
pip3 install yt-dlp
|
||
```
|
||
|
||
#### Option 2: Download yt-dlp binary
|
||
```bash
|
||
mkdir -p downloader
|
||
curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
|
||
chmod +x downloader/yt-dlp_macos
|
||
```
|
||
|
||
#### Install FFmpeg
|
||
```bash
|
||
brew install ffmpeg
|
||
```
|
||
|
||
### Test Installation
|
||
```bash
|
||
python3 src/tests/test_macos.py
|
||
```
|
||
|
||
## 🚀 Quick Start
|
||
|
||
> **💡 Pro Tip**: For a complete list of all available commands, see `commands.txt` - you can copy/paste any command directly into your terminal!
|
||
|
||
### Download a Channel
|
||
```bash
|
||
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
|
||
```
|
||
|
||
### Download ALL Videos from a Channel (Not Just Songlist Matches)
|
||
```bash
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
|
||
```
|
||
|
||
### Download ALL Videos with Parallel Processing
|
||
```bash
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
|
||
```
|
||
|
||
### Download ALL Videos with Limit
|
||
```bash
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
|
||
```
|
||
|
||
### Download Only Songlist Songs (Fast Mode)
|
||
```bash
|
||
python download_karaoke.py --songlist-only --limit 5
|
||
```
|
||
|
||
### Download with Parallel Processing
|
||
```bash
|
||
python download_karaoke.py --parallel --songlist-only --limit 10
|
||
```
|
||
|
||
### Focus on Specific Playlists by Title
|
||
```bash
|
||
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"
|
||
```
|
||
|
||
### Focus on Specific Playlists from Custom File
|
||
```bash
|
||
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
|
||
```
|
||
|
||
### Force Download from Channels (Bypass All Existing File Checks)
|
||
```bash
|
||
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
|
||
```
|
||
|
||
### Download with Fuzzy Matching
|
||
```bash
|
||
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
|
||
```
|
||
|
||
### Test Download Plan (Dry Run)
|
||
```bash
|
||
python download_karaoke.py --songlist-only --limit 5 --dry-run
|
||
```
|
||
|
||
### Test Channel Download Plan (Dry Run)
|
||
```bash
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run
|
||
```
|
||
|
||
### Download Latest N Videos Per Channel
|
||
```bash
|
||
python download_karaoke.py --latest-per-channel --limit 5
|
||
```
|
||
|
||
### Download Latest N Videos Per Channel (with fuzzy matching)
|
||
```bash
|
||
python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
|
||
```
|
||
|
||
### Prioritize Songlist in Download Queue
|
||
```bash
|
||
python download_karaoke.py --songlist-priority
|
||
```
|
||
|
||
### Show Songlist Download Progress
|
||
```bash
|
||
python download_karaoke.py --songlist-status
|
||
```
|
||
|
||
### Limit Number of Downloads
|
||
```bash
|
||
python download_karaoke.py --limit 5
|
||
```
|
||
|
||
### Override Resolution
|
||
```bash
|
||
python download_karaoke.py --resolution 1080p
|
||
```
|
||
|
||
### **Reset/Start Over for a Channel**
|
||
```bash
|
||
python download_karaoke.py --reset-channel SingKingKaraoke
|
||
```
|
||
|
||
### **Reset Channel and Songlist Songs**
|
||
```bash
|
||
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
|
||
```
|
||
|
||
### **Clear Channel Cache**
|
||
```bash
|
||
python download_karaoke.py --clear-cache SingKingKaraoke
|
||
python download_karaoke.py --clear-cache all
|
||
```
|
||
|
||
## 🧠 Songlist Integration
|
||
- Place your prioritized song list in `data/songList.json` (see example format below).
|
||
- The tool will match and prioritize these songs across all available channel videos.
|
||
- Use `--songlist-only` to download only these songs, or `--songlist-priority` to prioritize them in the queue.
|
||
- Use `--songlist-focus` to download only songs from specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`).
|
||
- Download progress for the songlist is tracked globally in `data/songlist_tracking.json`.
|
||
|
||
#### Example `data/songList.json`
|
||
```json
|
||
[
|
||
{
|
||
"title": "2025 - Apple Top 50",
|
||
"songs": [
|
||
{ "artist": "Kendrick Lamar & SZA", "title": "luther", "position": 1 },
|
||
{ "artist": "Kendrick Lamar", "title": "Not Like Us", "position": 2 }
|
||
]
|
||
},
|
||
{
|
||
"title": "2024 - Billboard Hot 100",
|
||
"songs": [
|
||
{ "artist": "Taylor Swift", "title": "Cruel Summer", "position": 1 },
|
||
{ "artist": "Billie Eilish", "title": "Happier Than Ever", "position": 2 }
|
||
]
|
||
}
|
||
]
|
||
```
|
||
|
||
## 🛠️ Tracking & Caching
|
||
- **data/karaoke_tracking.json**: Tracks all downloads, statuses, and formats
|
||
- **data/songlist_tracking.json**: Tracks global songlist download progress
|
||
- **data/server_duplicates_tracking.json**: Tracks songs found to be duplicates on the server for future skipping
|
||
- **data/channel_cache.json**: Caches channel video lists for performance
|
||
|
||
## 📂 Folder Structure
|
||
```
|
||
KaroakeVideoDownloader/
|
||
├── commands.txt # Complete CLI commands reference (copy/paste ready)
|
||
├── karaoke_downloader/ # All core Python code and utilities
|
||
│ ├── downloader.py # Main orchestrator and CLI interface
|
||
│ ├── cli.py # CLI entry point
|
||
│ ├── video_downloader.py # Core video download execution and orchestration
|
||
│ ├── tracking_manager.py # Download tracking and status management
|
||
│ ├── download_planner.py # Download plan building and channel scanning
|
||
│ ├── cache_manager.py # Cache operations and file I/O management
|
||
│ ├── channel_manager.py # Channel and file management operations
|
||
│ ├── songlist_manager.py # Songlist operations and tracking
|
||
│ ├── server_manager.py # Server song availability checking
|
||
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
|
||
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
|
||
│ ├── error_utils.py # Standardized error handling and formatting
|
||
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
|
||
│ ├── id3_utils.py # ID3 tagging utilities
|
||
│ ├── config_manager.py # Configuration management with dataclasses
|
||
│ ├── file_utils.py # Centralized file operations and filename handling
|
||
│ ├── song_validator.py # Centralized song validation logic
|
||
│ ├── check_resolution.py # Resolution checker utility
|
||
│ ├── resolution_cli.py # Resolution config CLI
|
||
│ └── tracking_cli.py # Tracking management CLI
|
||
├── data/ # All config, tracking, cache, and songlist files
|
||
│ ├── config.json
|
||
│ ├── karaoke_tracking.json
|
||
│ ├── songlist_tracking.json
|
||
│ ├── channel_cache.json
|
||
│ ├── channels.txt
|
||
│ └── songList.json
|
||
├── downloads/ # All video output
|
||
│ └── [ChannelName]/ # Per-channel folders
|
||
├── logs/ # Download logs
|
||
├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
|
||
├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
|
||
├── src/tests/ # Test scripts
|
||
│ ├── test_macos.py # macOS setup and functionality tests
|
||
│ └── test_platform.py # Platform detection tests
|
||
├── download_karaoke.py # Main entry point (thin wrapper)
|
||
├── README.md
|
||
├── PRD.md
|
||
├── requirements.txt
|
||
└── download_karaoke.bat # (optional Windows launcher)
|
||
```
|
||
|
||
## 🚦 CLI Options
|
||
|
||
> **📋 Complete Command Reference**: See `commands.txt` for all available commands with examples - perfect for copy/paste!
|
||
|
||
### Key Options:
|
||
- `--file <data/channels.txt>`: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
|
||
- `--songlist-priority`: Prioritize songlist songs in download queue
|
||
- `--songlist-only`: Download only songs from the songlist
|
||
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
|
||
- `--songlist-file <FILE_PATH>`: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
|
||
- `--songlist-status`: Show songlist download progress
|
||
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
|
||
- `--resolution <720p|1080p|...>`: Override resolution
|
||
- `--status`: Show download/tracking status
|
||
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
|
||
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
|
||
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
|
||
- `--clear-server-duplicates`: **Clear server duplicates tracking (allows re-checking songs against server)**
|
||
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
|
||
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
|
||
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
|
||
- `--parallel`: Enable parallel downloads for improved speed (defaults to 3 workers)
|
||
- `--workers <N>`: Number of parallel download workers (1-10, default: 3, only used with --parallel)
|
||
- `--generate-songlist <DIR1> <DIR2>...`: **Generate song list from MP4 files with ID3 tags in specified directories**
|
||
- `--no-append-songlist`: **Create a new song list instead of appending when using --generate-songlist**
|
||
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
|
||
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
|
||
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files**
|
||
- `--dry-run`: **Build download plan and show what would be downloaded without actually downloading anything**
|
||
|
||
## 📝 Example Usage
|
||
|
||
> **💡 For complete examples**: See `commands.txt` for all command variations with explanations!
|
||
|
||
```bash
|
||
# Fast mode with fuzzy matching (no need to specify --file)
|
||
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
|
||
|
||
# Parallel downloads for faster processing
|
||
python download_karaoke.py --parallel --songlist-only --limit 10
|
||
|
||
# Latest videos per channel with parallel downloads
|
||
python download_karaoke.py --parallel --latest-per-channel --limit 5
|
||
|
||
# Traditional full scan (no limit)
|
||
python download_karaoke.py --songlist-only
|
||
|
||
# Focused fuzzy matching (target specific playlists with flexible matching)
|
||
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
|
||
|
||
# Focus on specific playlists from a custom file
|
||
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --limit 10
|
||
|
||
# Force download with fuzzy matching (bypass all existing file checks)
|
||
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
|
||
|
||
# Channel-specific operations
|
||
python download_karaoke.py --reset-channel SingKingKaraoke
|
||
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
|
||
python download_karaoke.py --clear-cache all
|
||
python download_karaoke.py --clear-server-duplicates
|
||
|
||
# Download ALL videos from a specific channel
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
|
||
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
|
||
|
||
# Song list generation from MP4 files
|
||
python download_karaoke.py --generate-songlist /path/to/mp4/directory
|
||
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
|
||
|
||
# Generate report of songs that couldn't be found
|
||
python download_karaoke.py --generate-unmatched-report
|
||
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
|
||
```
|
||
|
||
## 🏷️ ID3 Tagging
|
||
- Adds artist/title/album/genre to MP4 files using mutagen (if installed)
|
||
|
||
## 📋 Song List Generation
|
||
- **Generate song lists from existing MP4 files**: Use `--generate-songlist` to create song lists from directories containing MP4 files with ID3 tags
|
||
- **Automatic ID3 extraction**: Extracts artist and title from MP4 files' ID3 tags
|
||
- **Directory-based organization**: Each directory becomes a playlist with the directory name as the title
|
||
- **Position tracking**: Songs are numbered starting from 1 based on file order
|
||
- **Append or replace**: Choose to append to existing song list or create a new one with `--no-append-songlist`
|
||
- **Multiple directories**: Process multiple directories in a single command
|
||
|
||
## 🧹 Cleanup
|
||
- Removes `.info.json` and `.meta` files after download
|
||
|
||
## 🛠️ Configuration
|
||
- All options are in `data/config.json` (format, resolution, metadata, etc.)
|
||
- You can edit this file or use CLI flags to override
|
||
|
||
## 📋 Command Reference File
|
||
|
||
**`commands.txt`** contains a comprehensive list of all CLI commands with explanations. This file is designed for easy copy/paste usage and includes:
|
||
- All basic download commands
|
||
- Songlist operations
|
||
- Latest-per-channel downloads
|
||
- Cache and tracking management
|
||
- Reset and cleanup operations
|
||
- Advanced combinations
|
||
- Common workflows
|
||
- Troubleshooting commands
|
||
|
||
> **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
|
||
|
||
## 📚 Documentation Standards
|
||
|
||
### **Documentation Location**
|
||
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
|
||
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
|
||
- **Use the existing sections in PRD.md and README.md to track all project evolution**
|
||
|
||
### **Where to Document Changes**
|
||
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
|
||
- **README.md**: User-facing features, usage instructions, and high-level improvements
|
||
- **CHANGELOG.md**: Version-specific release notes and change summaries
|
||
|
||
### **Documentation Requirements**
|
||
- **All new features must be documented in both PRD.md and README.md**
|
||
- **All refactoring efforts must be documented in the appropriate sections**
|
||
- **All bug fixes must be documented with technical details**
|
||
- **Version numbers and dates should be clearly marked**
|
||
- **Benefits and improvements should be explicitly stated**
|
||
|
||
### **Maintenance Responsibility**
|
||
- **Keep PRD.md and README.md synchronized with code changes**
|
||
- **Update documentation immediately when implementing new features**
|
||
- **Remove outdated information and consolidate related changes**
|
||
- **Ensure all CLI options and features are documented in both files**
|
||
|
||
## 🔧 Refactoring Improvements (v3.3)
|
||
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
|
||
|
||
### **New Utility Modules (v3.3)**
|
||
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
|
||
- `sanitize_filename()`: Create safe filenames from artist/title
|
||
- `generate_possible_filenames()`: Generate filename patterns for different modes
|
||
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
|
||
- `is_valid_mp4_file()`: Validate MP4 files with header checking
|
||
- `cleanup_temp_files()`: Remove temporary yt-dlp files
|
||
- `ensure_directory_exists()`: Safe directory creation
|
||
|
||
- **`song_validator.py`**: Centralized song validation logic
|
||
- `SongValidator` class: Unified logic for checking if songs should be downloaded
|
||
- `should_skip_song()`: Comprehensive validation with multiple criteria
|
||
- `mark_song_failed()`: Consistent failure tracking
|
||
- `handle_download_failure()`: Standardized error handling
|
||
|
||
- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
|
||
- `ConfigManager` class: Type-safe configuration loading and caching
|
||
- `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
|
||
- Configuration validation and merging with defaults
|
||
- Dynamic resolution updates
|
||
|
||
### **Benefits Achieved**
|
||
- **Eliminated Code Duplication**: ~150 lines of duplicate code removed across modules
|
||
- **Centralized File Operations**: Single source of truth for filename handling and file validation
|
||
- **Unified Song Validation**: Consistent logic for checking if songs should be downloaded
|
||
- **Enhanced Type Safety**: Comprehensive type hints across all new modules
|
||
- **Improved Configuration Management**: Structured configuration with validation and caching
|
||
- **Better Error Handling**: Consistent patterns via centralized utilities
|
||
- **Enhanced Maintainability**: Changes to file operations or song validation only require updates in one place
|
||
- **Improved Testability**: Modular components can be tested independently
|
||
- **Better Developer Experience**: Clear function signatures and comprehensive documentation
|
||
|
||
### **New Parallel Download System (v3.4)**
|
||
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
|
||
- **Configurable concurrency:** Use `--parallel` to enable parallel downloads with 3 workers by default, or `--parallel --workers N` for custom worker count (1-10)
|
||
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
|
||
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
|
||
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
|
||
- **Backward compatibility:** Sequential downloads remain the default when `--parallel` is not used
|
||
- **Performance improvements:** Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
|
||
- **Integrated with all modes:** Works with both songlist-across-channels and latest-per-channel download modes
|
||
|
||
### **Previous Improvements (v3.2)**
|
||
- **Centralized yt-dlp Command Generation**: Standardized command building and execution across all download operations
|
||
- **Enhanced Error Handling**: Structured exception hierarchy with consistent error messages and formatting
|
||
- **Abstracted Download Pipeline**: Reusable download → verify → tag → track process for consistent processing
|
||
- **Download plan pre-scan:** Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
|
||
- **Latest-per-channel plan:** Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
|
||
- **Fast mode with early exit:** When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
|
||
- **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
|
||
- **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
|
||
- **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
|
||
- **Robust interruption handling:** Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
|
||
- **Optimized scanning algorithm:** High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
|
||
- **Enhanced cache management:** Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
|
||
- **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads.
|
||
|
||
## 🐞 Troubleshooting
|
||
- **Windows**: Ensure `yt-dlp.exe` is in the `downloader/` folder
|
||
- **macOS**: Run `python3 setup_macos.py` to set up yt-dlp and FFmpeg
|
||
- Check `logs/` for error details
|
||
- Use `python -m karaoke_downloader.check_resolution` to verify video quality
|
||
- If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH
|
||
- For best fuzzy matching, install rapidfuzz: `pip install rapidfuzz` (otherwise falls back to slower, less accurate difflib)
|
||
|
||
---
|
||
|
||
**Happy Karaoke! 🎤** |