KaraokeVideoDownloader/README.md

651 lines
35 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎤 Karaoke Video Downloader
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection.
## ✨ Features
- 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist
- 📂 **Organized Storage**: Each channel gets its own folder in `downloads/`
- 📝 **Robust Tracking**: Tracks all downloads, statuses, and formats in JSON
- 🏆 **Songlist Prioritization**: Prioritize or restrict downloads to a custom songlist
- 🔄 **Batch Saving & Caching**: Efficient, minimizes API calls
- 🏷️ **ID3 Tagging**: Adds artist/title metadata to MP4 files
- 🧹 **Automatic Cleanup**: Removes extra yt-dlp files
- 📈 **Real-Time Progress**: Detailed console and log output
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- 🧩 **Enhanced Fuzzy Matching**: Advanced fuzzy string matching for songlist-to-video matching with improved video title parsing (handles multiple title formats like "Title Karaoke | Artist Karaoke Version")
-**Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
- 🛡️ **Robust Interruption Handling**: Progress is saved after each download, preventing re-downloads if the process is interrupted
-**Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
-**Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup)
- 📊 **Unmatched Songs Reports**: Generate detailed reports of songs that couldn't be found in any channel with `--generate-unmatched-report`
- 🛡️ **Duplicate File Prevention**: Automatically detects and prevents duplicate files with `(2)`, `(3)` suffixes, with cleanup utility for existing duplicates
- 🏷️ **Consistent Metadata**: Filename and ID3 tag use identical artist/title format for clear file identification
- 🍎 **macOS Support**: Automatic platform detection and setup with native macOS binaries and FFmpeg integration
## 🏗️ Architecture
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
### **Configurable Data Directory (v3.4.7)**
- **Centralized Data Path Management**: `data_path_manager.py` provides unified data directory path management
- **Configurable Location**: Data directory path can be set in `config/config.json` under `folder_structure.data_dir`
- **Backward Compatibility**: Defaults to "data" directory if not configured
- **Cross-Project Integration**: Enables the karaoke downloader to be used as a component in other projects with different data directory structures
### Core Modules:
- **`downloader.py`**: Main orchestrator and CLI interface
- **`video_downloader.py`**: Core video download execution and orchestration
- **`tracking_manager.py`**: Download tracking and status management
- **`download_planner.py`**: Download plan building and channel scanning
- **`cache_manager.py`**: Cache operations and file I/O management
- **`channel_manager.py`**: Channel and file management operations
- **`songlist_manager.py`**: Songlist operations and tracking
- **`server_manager.py`**: Server song availability checking
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
### Utility Modules (v3.2):
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
- **`error_utils.py`**: Standardized error handling and formatting
- **`download_pipeline.py`**: Abstracted download → verify → tag → track pipeline
- **`id3_utils.py`**: ID3 tagging utilities
- **`config_manager.py`**: Configuration management
- **`resolution_cli.py`**: Resolution checking utilities
- **`tracking_cli.py`**: Tracking management CLI
### New Utility Modules (v3.3):
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
- **`song_validator.py`**: Centralized song validation logic for checking if songs should be downloaded
### New Utility Modules (v3.4.7):
- **`data_path_manager.py`**: Centralized data directory path management and file path resolution
### **Unified Download Workflow (v3.4.5)**
- **`execute_unified_download_workflow()`**: Centralized download execution that all modes use
- **`_execute_sequential_downloads()`**: Sequential download execution using DownloadPipeline
- **`_execute_parallel_downloads()`**: Parallel download execution using ParallelDownloader
### **Benefits of Enhanced Modular Architecture:**
- **Single Responsibility**: Each module has a focused purpose
- **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
- **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules
- **Testability**: Individual components can be tested separately
- **Maintainability**: Easier to find and fix issues
- **Reusability**: Components can be used independently
- **Robustness**: Better error handling and interruption recovery
- **Consistency**: Standardized error messages and processing pipelines
- **Type Safety**: Comprehensive type hints across all new modules
- **Unified Execution**: All download modes use the same execution pipeline for consistency
## 🔧 Development Guidelines
### **Adding New Download Modes**
When adding new download modes, follow the unified workflow pattern to ensure consistency:
#### **1. Build Download Plan (Mode-Specific)**
```python
def download_new_mode(self, ...):
# Build download plan with standard structure
download_plan = []
for video in videos_to_download:
download_plan.append({
"video_id": video["id"],
"artist": artist,
"title": title,
"filename": filename,
"channel_name": channel_name,
"video_title": video["title"],
"force_download": force_download
})
# Use unified execution workflow
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file,
limit=limit,
show_progress=True,
)
return success
```
#### **2. Key Principles**
- **NEVER implement custom download execution logic** - always use `execute_unified_download_workflow()`
- **Focus on download plan building** - that's where mode-specific logic belongs
- **Use the standard download plan structure** for consistency
- **Implement cache file handling** for progress tracking and resume functionality
- **Test with both sequential and parallel modes** to ensure compatibility
#### **3. Benefits of Unified Architecture**
- **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- **Automatic Features**: New modes automatically get parallel downloads, progress tracking, and cache management
- **Maintainability**: Changes to download execution only need to be made in one place
- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
## 🔧 Recent Improvements (v3.4.1)
### **Enhanced Fuzzy Matching**
- **Improved title parsing**: Enhanced `extract_artist_title` function to handle multiple video title formats
- **Better matching accuracy**: Can now parse titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
- **Consistent parsing**: All modules now use the same parsing logic from `fuzzy_matcher.py`
- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
### **Fixed Import Conflicts**
- **Resolved import conflicts**: Updated modules to use the enhanced `extract_artist_title` from `fuzzy_matcher.py`
- **Consistent behavior**: All parts of the system use the same parsing logic
- **Cleaner codebase**: Eliminated duplicate code and import conflicts
### **Fixed --limit Parameter**
- **Correct limit application**: The `--limit` parameter now properly limits the scanning phase, not just downloads
- **Improved performance**: When using `--limit N`, only the first N songs are scanned, significantly reducing processing time
- **Accurate logging**: Logging messages now show the correct counts for songs that will actually be processed when using `--limit`
### **Code Quality Improvements**
- **Eliminated duplicate functions**: Removed duplicate `extract_artist_title` implementations
- **Fixed import conflicts**: Resolved inconsistencies between different parsing implementations
- **Single source of truth**: All title parsing logic is now centralized in `fuzzy_matcher.py`
## 🔧 Recent Improvements (v3.4.5)
### **Unified Download Workflow Architecture**
- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution
- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation
- **Unified cache management**: Consistent progress tracking and resume functionality across all modes
### **What Was Fixed**
- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies
- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented
- **Code Duplication**: Eliminated duplicate download execution logic across different modes
- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management
### **Benefits**
-**Consistency**: All modes behave identically for execution, progress tracking, and error handling
-**Maintainability**: Changes to download execution only need to be made in one place
-**Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
-**Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
-**Testing**: Easier to test since all modes use the same execution logic
## 🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)
### **Duplicate File Prevention**
- **Enhanced file existence checking**: Now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
- **Automatic duplicate prevention**: Skips downloads when files already exist (including duplicates)
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files
- **Cleanup utility**: `data/cleanup_duplicate_files.py` helps identify and remove existing duplicate files
### **Filename vs ID3 Tag Consistency**
- **Consistent metadata**: Filename and ID3 tag now use identical artist/title format
- **Removed extra suffixes**: No more "(Karaoke Version)" in ID3 tags that don't match filenames
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction
### **Benefits**
-**No more duplicate files** with `(2)`, `(3)` suffixes
-**Consistent metadata** between filename and ID3 tags
-**Efficient disk usage** by preventing unnecessary downloads
-**Clear file identification** with consistent naming
### **Clean Up Existing Duplicates**
```bash
# Run the cleanup utility to find and remove existing duplicates
python data/cleanup_duplicate_files.py
# Choose option 1 for dry run (recommended first)
# Choose option 2 to actually delete duplicates
```
## 📋 Requirements
- **Windows 10/11 or macOS 10.14+**
- **Python 3.7+**
- **yt-dlp binary** (platform-specific, see setup instructions below)
- **mutagen** (for ID3 tagging, optional)
- **ffmpeg/ffprobe** (for video validation, optional but recommended)
- **rapidfuzz** (for fuzzy matching, optional, falls back to difflib)
## 🍎 macOS Setup
### Automatic Setup (Recommended)
Run the macOS setup script to automatically set up yt-dlp and FFmpeg:
```bash
python3 setup_macos.py
```
This script will:
- Detect your macOS version
- Offer installation options for yt-dlp (pip or binary download)
- Install FFmpeg via Homebrew
- Test the installation
### Manual Setup
If you prefer to set up manually:
#### Option 1: Install yt-dlp via pip
```bash
pip3 install yt-dlp
```
#### Option 2: Download yt-dlp binary
```bash
mkdir -p downloader
curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
chmod +x downloader/yt-dlp_macos
```
#### Install FFmpeg
```bash
brew install ffmpeg
```
### Test Installation
```bash
python3 src/tests/test_macos.py
```
## 🚀 Quick Start
> **💡 Pro Tip**: For a complete list of all available commands, see `commands.txt` - you can copy/paste any command directly into your terminal!
### Download a Channel
```bash
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
```
### Download ALL Videos from a Channel (Not Just Songlist Matches)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
```
### Download ALL Videos with Parallel Processing
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
```
### Download ALL Videos with Limit
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
```
### Download Only Songlist Songs (Fast Mode)
```bash
python download_karaoke.py --songlist-only --limit 5
```
### Download with Parallel Processing
```bash
python download_karaoke.py --parallel --songlist-only --limit 10
```
### Focus on Specific Playlists by Title
```bash
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"
```
### Focus on Specific Playlists from Custom File
```bash
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
```
### Force Download from Channels (Bypass All Existing File Checks)
```bash
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
```
### Download with Fuzzy Matching
```bash
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
```
### Test Download Plan (Dry Run)
```bash
python download_karaoke.py --songlist-only --limit 5 --dry-run
```
### Test Channel Download Plan (Dry Run)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run
```
### Download Latest N Videos Per Channel
```bash
python download_karaoke.py --latest-per-channel --limit 5
```
### Download Latest N Videos Per Channel (with fuzzy matching)
```bash
python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
```
### Prioritize Songlist in Download Queue
```bash
python download_karaoke.py --songlist-priority
```
### Show Songlist Download Progress
```bash
python download_karaoke.py --songlist-status
```
### Limit Number of Downloads
```bash
python download_karaoke.py --limit 5
```
### Override Resolution
```bash
python download_karaoke.py --resolution 1080p
```
### **Reset/Start Over for a Channel**
```bash
python download_karaoke.py --reset-channel SingKingKaraoke
```
### **Reset Channel and Songlist Songs**
```bash
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
```
### **Clear Channel Cache**
```bash
python download_karaoke.py --clear-cache SingKingKaraoke
python download_karaoke.py --clear-cache all
```
## 🧠 Songlist Integration
- Place your prioritized song list in `data/songList.json` (see example format below).
- The tool will match and prioritize these songs across all available channel videos.
- Use `--songlist-only` to download only these songs, or `--songlist-priority` to prioritize them in the queue.
- Use `--songlist-focus` to download only songs from specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`).
- Download progress for the songlist is tracked globally in `data/songlist_tracking.json`.
#### Example `data/songList.json`
```json
[
{
"title": "2025 - Apple Top 50",
"songs": [
{ "artist": "Kendrick Lamar & SZA", "title": "luther", "position": 1 },
{ "artist": "Kendrick Lamar", "title": "Not Like Us", "position": 2 }
]
},
{
"title": "2024 - Billboard Hot 100",
"songs": [
{ "artist": "Taylor Swift", "title": "Cruel Summer", "position": 1 },
{ "artist": "Billie Eilish", "title": "Happier Than Ever", "position": 2 }
]
}
]
```
## 🛠️ Tracking & Caching
- **data/karaoke_tracking.json**: Tracks all downloads, statuses, and formats
- **data/songlist_tracking.json**: Tracks global songlist download progress
- **data/server_duplicates_tracking.json**: Tracks songs found to be duplicates on the server for future skipping
- **data/channel_cache.json**: Caches channel video lists for performance
## 📂 Folder Structure
```
KaroakeVideoDownloader/
├── commands.txt # Complete CLI commands reference (copy/paste ready)
├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── tracking_manager.py # Download tracking and status management
│ ├── download_planner.py # Download plan building and channel scanning
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── channel_manager.py # Channel and file management operations
│ ├── songlist_manager.py # Songlist operations and tracking
│ ├── server_manager.py # Server song availability checking
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
│ ├── error_utils.py # Standardized error handling and formatting
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
│ ├── id3_utils.py # ID3 tagging utilities
│ ├── config_manager.py # Configuration management with dataclasses
│ ├── file_utils.py # Centralized file operations and filename handling
│ ├── song_validator.py # Centralized song validation logic
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── config/ # Configuration files
│ └── config.json # Main configuration file
├── data/ # All tracking, cache, and songlist files
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.json # Channel configuration with parsing rules
│ └── songList.json
├── utilities/ # Utility scripts and tools
│ ├── add_manual_video.py # Manual video management
│ ├── build_cache_from_raw.py # Cache building utility
│ ├── cleanup_duplicate_files.py # File cleanup utilities
│ ├── cleanup_recent_tracking.py # Tracking cleanup utilities
│ ├── deduplicate_songlist_tracking.py # Data deduplication
│ ├── fix_artist_name_format.py # Data cleanup utilities
│ ├── fix_artist_name_format_simple.py
│ ├── fix_code_quality.py # Development tools
│ ├── reset_and_redownload.py # Maintenance utilities
│ └── songlist_report.py # Reporting utilities
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
├── src/tests/ # Test scripts
│ ├── test_macos.py # macOS setup and functionality tests
│ └── test_platform.py # Platform detection tests
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat # (optional Windows launcher)
```
## 🚦 CLI Options
> **📋 Complete Command Reference**: See `commands.txt` for all available commands with examples - perfect for copy/paste!
### Key Options:
- `--file <data/channels.txt>`: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
- `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
- `--songlist-file <FILE_PATH>`: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
- `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution
- `--status`: Show download/tracking status
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--clear-server-duplicates`: **Clear server duplicates tracking (allows re-checking songs against server)**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
- `--parallel`: Enable parallel downloads for improved speed (defaults to 3 workers)
- `--workers <N>`: Number of parallel download workers (1-10, default: 3, only used with --parallel)
- `--generate-songlist <DIR1> <DIR2>...`: **Generate song list from MP4 files with ID3 tags in specified directories**
- `--no-append-songlist`: **Create a new song list instead of appending when using --generate-songlist**
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files**
- `--dry-run`: **Build download plan and show what would be downloaded without actually downloading anything**
## 📝 Example Usage
> **💡 For complete examples**: See `commands.txt` for all command variations with explanations!
```bash
# Fast mode with fuzzy matching (no need to specify --file)
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
# Parallel downloads for faster processing
python download_karaoke.py --parallel --songlist-only --limit 10
# Latest videos per channel with parallel downloads
python download_karaoke.py --parallel --latest-per-channel --limit 5
# Traditional full scan (no limit)
python download_karaoke.py --songlist-only
# Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
# Focus on specific playlists from a custom file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --limit 10
# Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
# Channel-specific operations
python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates
# Download ALL videos from a specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Song list generation from MP4 files
python download_karaoke.py --generate-songlist /path/to/mp4/directory
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
# Generate report of songs that couldn't be found
python download_karaoke.py --generate-unmatched-report
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
```
## 🏷️ ID3 Tagging
- Adds artist/title/album/genre to MP4 files using mutagen (if installed)
## 📋 Song List Generation
- **Generate song lists from existing MP4 files**: Use `--generate-songlist` to create song lists from directories containing MP4 files with ID3 tags
- **Automatic ID3 extraction**: Extracts artist and title from MP4 files' ID3 tags
- **Directory-based organization**: Each directory becomes a playlist with the directory name as the title
- **Position tracking**: Songs are numbered starting from 1 based on file order
- **Append or replace**: Choose to append to existing song list or create a new one with `--no-append-songlist`
- **Multiple directories**: Process multiple directories in a single command
## 🧹 Cleanup
- Removes `.info.json` and `.meta` files after download
## 🛠️ Configuration
- All options are in `config/config.json` (format, resolution, metadata, etc.)
- You can edit this file or use CLI flags to override
- **Configurable Data Directory**: The data directory path can be configured in `config/config.json` under `folder_structure.data_dir` (default: "data")
## 📋 Command Reference File
**`commands.txt`** contains a comprehensive list of all CLI commands with explanations. This file is designed for easy copy/paste usage and includes:
- All basic download commands
- Songlist operations
- Latest-per-channel downloads
- Cache and tracking management
- Reset and cleanup operations
- Advanced combinations
- Common workflows
- Troubleshooting commands
> **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
## 📚 Documentation Standards
### **Documentation Location**
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
- **Use the existing sections in PRD.md and README.md to track all project evolution**
### **Where to Document Changes**
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
- **README.md**: User-facing features, usage instructions, and high-level improvements
- **CHANGELOG.md**: Version-specific release notes and change summaries
### **Documentation Requirements**
- **All new features must be documented in both PRD.md and README.md**
- **All refactoring efforts must be documented in the appropriate sections**
- **All bug fixes must be documented with technical details**
- **Version numbers and dates should be clearly marked**
- **Benefits and improvements should be explicitly stated**
### **Maintenance Responsibility**
- **Keep PRD.md and README.md synchronized with code changes**
- **Update documentation immediately when implementing new features**
- **Remove outdated information and consolidate related changes**
- **Ensure all CLI options and features are documented in both files**
## 🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
### **New Utility Modules (v3.3)**
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
- `sanitize_filename()`: Create safe filenames from artist/title
- `generate_possible_filenames()`: Generate filename patterns for different modes
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
- `is_valid_mp4_file()`: Validate MP4 files with header checking
- `cleanup_temp_files()`: Remove temporary yt-dlp files
- `ensure_directory_exists()`: Safe directory creation
- **`song_validator.py`**: Centralized song validation logic
- `SongValidator` class: Unified logic for checking if songs should be downloaded
- `should_skip_song()`: Comprehensive validation with multiple criteria
- `mark_song_failed()`: Consistent failure tracking
- `handle_download_failure()`: Standardized error handling
- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
- `ConfigManager` class: Type-safe configuration loading and caching
- `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
- Configuration validation and merging with defaults
- Dynamic resolution updates
### **Benefits Achieved**
- **Eliminated Code Duplication**: ~150 lines of duplicate code removed across modules
- **Centralized File Operations**: Single source of truth for filename handling and file validation
- **Unified Song Validation**: Consistent logic for checking if songs should be downloaded
- **Enhanced Type Safety**: Comprehensive type hints across all new modules
- **Improved Configuration Management**: Structured configuration with validation and caching
- **Better Error Handling**: Consistent patterns via centralized utilities
- **Enhanced Maintainability**: Changes to file operations or song validation only require updates in one place
- **Improved Testability**: Modular components can be tested independently
- **Better Developer Experience**: Clear function signatures and comprehensive documentation
### **New Parallel Download System (v3.4)**
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
- **Configurable concurrency:** Use `--parallel` to enable parallel downloads with 3 workers by default, or `--parallel --workers N` for custom worker count (1-10)
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
- **Backward compatibility:** Sequential downloads remain the default when `--parallel` is not used
- **Performance improvements:** Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
- **Integrated with all modes:** Works with both songlist-across-channels and latest-per-channel download modes
### **Previous Improvements (v3.2)**
- **Centralized yt-dlp Command Generation**: Standardized command building and execution across all download operations
- **Enhanced Error Handling**: Structured exception hierarchy with consistent error messages and formatting
- **Abstracted Download Pipeline**: Reusable download → verify → tag → track process for consistent processing
- **Download plan pre-scan:** Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
- **Latest-per-channel plan:** Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
- **Fast mode with early exit:** When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
- **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
- **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
- **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
- **Robust interruption handling:** Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- **Optimized scanning algorithm:** High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
- **Enhanced cache management:** Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
- **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads.
## 🐞 Troubleshooting
- **Windows**: Ensure `yt-dlp.exe` is in the `downloader/` folder
- **macOS**: Run `python3 setup_macos.py` to set up yt-dlp and FFmpeg
- Check `logs/` for error details
- Use `python -m karaoke_downloader.check_resolution` to verify video quality
- If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH
- For best fuzzy matching, install rapidfuzz: `pip install rapidfuzz` (otherwise falls back to slower, less accurate difflib)
---
**Happy Karaoke! 🎤**