KaraokeVideoDownloader/PRD.md

290 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎤 Karaoke Video Downloader PRD (v3.5)
## ✅ Overview
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows, macOS, and Linux with automatic platform detection and optimized caching. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.
---
## 🏗️ Architecture
The codebase has been refactored into focused modules with centralized utilities:
### Core Modules:
- **`downloader.py`**: Main orchestrator and CLI interface
- **`video_downloader.py`**: Core video download execution and orchestration
- **`tracking_manager.py`**: Download tracking and status management
- **`download_planner.py`**: Download plan building and channel scanning
- **`cache_manager.py`**: Cache operations and file I/O management
- **`channel_manager.py`**: Channel and file management operations
- **`songlist_manager.py`**: Songlist operations and tracking
- **`server_manager.py`**: Server song availability checking
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
### Utility Modules (v3.2):
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
- **`error_utils.py`**: Standardized error handling and formatting
- **`download_pipeline.py`**: Abstracted download → verify → tag → track pipeline
- **`id3_utils.py`**: ID3 tagging utilities
- **`config_manager.py`**: Configuration management
- **`resolution_cli.py`**: Resolution checking utilities
- **`tracking_cli.py`**: Tracking management CLI
### New Utility Modules (v3.3):
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
- **`song_validator.py`**: Centralized song validation logic for checking if songs should be downloaded
### Benefits of Enhanced Modular Architecture:
- **Single Responsibility**: Each module has a focused purpose
- **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
- **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules
- **Testability**: Individual components can be tested separately
- **Maintainability**: Easier to find and fix issues
- **Reusability**: Components can be used independently
- **Robustness**: Better error handling and interruption recovery
- **Consistency**: Standardized error messages and processing pipelines
- **Type Safety**: Comprehensive type hints across all new modules
---
## 📋 Goals
- Download karaoke videos from YouTube channels or playlists.
- Organize downloads by channel (or playlist) in subfolders.
- Avoid re-downloading the same videos (robust tracking).
- Prioritize and track a custom songlist across channels.
- Allow flexible, user-friendly configuration.
- Provide robust interruption handling and progress recovery.
---
## 🧑‍💻 Target Users
- Karaoke DJs, home karaoke users, event hosts, or anyone needing offline karaoke video libraries.
- Users comfortable with command-line tools.
---
## ⚙️ Platform & Stack
- **Platform:** Windows, macOS, Linux
- **Interface:** Command-line (CLI)
- **Tech Stack:** Python 3.7+, yt-dlp (platform-specific binary), mutagen (for ID3 tagging)
---
## 📥 Input
- YouTube channel or playlist URLs (e.g. `https://www.youtube.com/@SingKingKaraoke/videos`)
- Optional: `data/channels.txt` file with multiple channel URLs (one per line) - **now defaults to this file if not specified**
- Optional: `data/songList.json` for prioritized song downloads
### Example Usage
```bash
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --songlist-only --limit 5
python download_karaoke.py --latest-per-channel --limit 3
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache SingKingKaraoke
```
---
## 📤 Output
- MP4 files in `downloads/<ChannelName>/` subfolders
- All videos tracked in `data/karaoke_tracking.json`
- Songlist progress tracked in `data/songlist_tracking.json`
- Logs in `logs/`
---
## 🛠️ Features
- ✅ Channel-based downloads (with per-channel folders)
- ✅ Robust JSON tracking (downloaded, partial, failed, etc.)
- ✅ Batch saving and channel video caching for performance
- ✅ Configurable download resolution and yt-dlp options (`data/config.json`)
- ✅ Songlist integration: prioritize and track custom songlists
- ✅ Songlist-only mode: download only songs from the songlist
- ✅ Songlist focus mode: download only songs from specific playlists by title
- ✅ Global songlist tracking to avoid duplicates across channels
- ✅ ID3 tagging for artist/title in MP4 files (mutagen)
- ✅ Real-time progress and detailed logging
- ✅ Automatic cleanup of extra yt-dlp files
-**Reset/clear channel tracking and files via CLI**
-**Clear channel cache via CLI**
-**Download plan pre-scan and caching**: Before downloading, the tool pre-scans all channels for songlist matches, builds a download plan, and prints stats. The plan is cached for 1 day in data/download_plan_cache.json for fast resuming and reliability. Use --force-download-plan to force a refresh.
-**Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
-**Fast mode with early exit**: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. If a download fails, it continues scanning until the limit is satisfied or all channels are exhausted.
-**Deduplication across channels**: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates.
-**Fuzzy matching**: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib.
-**Default channel file**: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list.
-**Robust interruption handling**: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
-**Optimized scanning performance**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching of large songlists and channels.
-**Centralized yt-dlp command generation**: Standardized command building and execution across all download operations
-**Enhanced error handling**: Structured exception hierarchy with consistent error messages and formatting
-**Abstracted download pipeline**: Reusable download → verify → tag → track process for consistent processing
-**Reduced code duplication**: Eliminated duplicate code across modules through centralized utilities
-**Centralized file operations**: Single source of truth for filename sanitization, file validation, and path operations
-**Centralized song validation**: Unified logic for checking if songs should be downloaded across all modules
-**Enhanced configuration management**: Structured configuration with dataclasses, type safety, and validation
---
## 📂 Folder Structure
```
KaroakeVideoDownloader/
├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── tracking_manager.py # Download tracking and status management
│ ├── download_planner.py # Download plan building and channel scanning
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── channel_manager.py # Channel and file management operations
│ ├── songlist_manager.py # Songlist operations and tracking
│ ├── server_manager.py # Server song availability checking
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
│ ├── error_utils.py # Standardized error handling and formatting
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
│ ├── id3_utils.py # ID3 tagging utilities
│ ├── config_manager.py # Configuration management with dataclasses
│ ├── file_utils.py # Centralized file operations and filename handling
│ ├── song_validator.py # Centralized song validation logic
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.txt
│ └── songList.json
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
├── downloader/yt-dlp # yt-dlp binary (Linux)
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat # (optional Windows launcher)
```
---
## 🚦 CLI Options (Summary)
- `--file <data/channels.txt>`: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
- `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
- `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution
- `--status`: Show download/tracking status
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--force-download-plan`: **Force refresh the download plan cache (re-scan all channels for matches)**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: **Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)**
- `--fuzzy-threshold <N>`: **Fuzzy match threshold (0-100, default 85)**
- `--parallel`: **Enable parallel downloads for improved speed**
- `--workers <N>`: **Number of parallel download workers (1-10, default: 3)**
---
## 🧠 Logic Highlights
- **Tracking:** All downloads, statuses, and formats are tracked in JSON files for reliability and deduplication.
- **Songlist:** Loads and normalizes `data/songList.json`, matches against available videos, and prioritizes or restricts downloads accordingly.
- **Batch/Caching:** Channel video lists are cached to minimize API calls; tracking is batch-saved for performance.
- **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files.
- **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download.
- **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels.
## 🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
### **New Utility Modules (v3.3)**
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
- `sanitize_filename()`: Create safe filenames from artist/title
- `generate_possible_filenames()`: Generate filename patterns for different modes
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
- `is_valid_mp4_file()`: Validate MP4 files with header checking
- `cleanup_temp_files()`: Remove temporary yt-dlp files
- `ensure_directory_exists()`: Safe directory creation
- **`song_validator.py`**: Centralized song validation logic
- `SongValidator` class: Unified logic for checking if songs should be downloaded
- `should_skip_song()`: Comprehensive validation with multiple criteria
- `mark_song_failed()`: Consistent failure tracking
- `handle_download_failure()`: Standardized error handling
- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
- `ConfigManager` class: Type-safe configuration loading and caching
- `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
- Configuration validation and merging with defaults
- Dynamic resolution updates
### **Benefits Achieved**
- **Eliminated Code Duplication**: ~150 lines of duplicate code removed across modules
- **Centralized File Operations**: Single source of truth for filename handling and file validation
- **Unified Song Validation**: Consistent logic for checking if songs should be downloaded
- **Enhanced Type Safety**: Comprehensive type hints across all new modules
- **Improved Configuration Management**: Structured configuration with validation and caching
- **Better Error Handling**: Consistent patterns via centralized utilities
- **Enhanced Maintainability**: Changes to file operations or song validation only require updates in one place
- **Improved Testability**: Modular components can be tested independently
- **Better Developer Experience**: Clear function signatures and comprehensive documentation
### **Previous Improvements (v3.2)**
- **Centralized yt-dlp Command Generation**: Standardized command building and execution across all download operations
- **Enhanced Error Handling**: Structured exception hierarchy with consistent error messages and formatting
- **Abstracted Download Pipeline**: Reusable download → verify → tag → track process for consistent processing
- **Download plan pre-scan:** Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
- **Latest-per-channel plan:** Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
- **Fast mode with early exit:** When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
- **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
- **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
- **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
- **Robust interruption handling:** Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- **Optimized scanning algorithm:** High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
- **Enhanced cache management:** Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
- **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads.
### **New Parallel Download System (v3.4)**
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
- **Configurable concurrency:** Use `--parallel --workers N` to enable parallel downloads with N workers (1-10)
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
- **Backward compatibility:** Sequential downloads remain the default when `--parallel` is not used
- **Performance improvements:** Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
- **Integrated with all modes:** Works with both songlist-across-channels and latest-per-channel download modes
### **Cross-Platform Support (v3.5)**
- **Platform detection:** Automatic detection of Windows, macOS, and Linux systems
- **Flexible yt-dlp integration:** Supports both binary files and pip-installed yt-dlp modules
- **Platform-specific configuration:** Automatic selection of appropriate yt-dlp binary/command for each platform
- **Setup automation:** `setup_platform.py` script for easy platform-specific setup
- **Command parsing:** Intelligent parsing of yt-dlp commands (file paths vs. module commands)
- **Enhanced documentation:** Platform-specific setup instructions and troubleshooting
- **Backward compatibility:** Maintains full compatibility with existing Windows installations
- **FFmpeg integration:** Automatic FFmpeg installation and configuration for optimal video processing
- **Optimized caching:** Enhanced channel video caching with format compatibility and instant video list loading
---
## 🚀 Future Enhancements
- [ ] Web UI for easier management
- [ ] More advanced song matching (multi-language)
- [ ] Download scheduling and retry logic
- [ ] More granular status reporting
- [x] **Parallel downloads for improved speed****COMPLETED**
- [x] **Cross-platform support (Windows, macOS, Linux)****COMPLETED**
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows
- [ ] Plugin system for custom file operations
- [ ] Advanced configuration UI
- [ ] Real-time download progress visualization