KaraokeVideoDownloader/PRD.md

222 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎤 Karaoke Video Downloader PRD (v3.2)
## ✅ Overview
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.
---
## 🏗️ Architecture
The codebase has been refactored into focused modules with centralized utilities:
### Core Modules:
- **`downloader.py`**: Main orchestrator and CLI interface
- **`video_downloader.py`**: Core video download execution and orchestration
- **`tracking_manager.py`**: Download tracking and status management
- **`download_planner.py`**: Download plan building and channel scanning
- **`cache_manager.py`**: Cache operations and file I/O management
- **`channel_manager.py`**: Channel and file management operations
- **`songlist_manager.py`**: Songlist operations and tracking
- **`server_manager.py`**: Server song availability checking
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
### New Utility Modules (v3.2):
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
- **`error_utils.py`**: Standardized error handling and formatting
- **`download_pipeline.py`**: Abstracted download → verify → tag → track pipeline
- **`id3_utils.py`**: ID3 tagging utilities
- **`config_manager.py`**: Configuration management
- **`resolution_cli.py`**: Resolution checking utilities
- **`tracking_cli.py`**: Tracking management CLI
### Benefits of Enhanced Modular Architecture:
- **Single Responsibility**: Each module has a focused purpose
- **Centralized Utilities**: Common operations (yt-dlp commands, error handling) are centralized
- **Reduced Duplication**: Eliminated code duplication across modules
- **Testability**: Individual components can be tested separately
- **Maintainability**: Easier to find and fix issues
- **Reusability**: Components can be used independently
- **Robustness**: Better error handling and interruption recovery
- **Consistency**: Standardized error messages and processing pipelines
---
## 📋 Goals
- Download karaoke videos from YouTube channels or playlists.
- Organize downloads by channel (or playlist) in subfolders.
- Avoid re-downloading the same videos (robust tracking).
- Prioritize and track a custom songlist across channels.
- Allow flexible, user-friendly configuration.
- Provide robust interruption handling and progress recovery.
---
## 🧑‍💻 Target Users
- Karaoke DJs, home karaoke users, event hosts, or anyone needing offline karaoke video libraries.
- Users comfortable with command-line tools.
---
## ⚙️ Platform & Stack
- **Platform:** Windows
- **Interface:** Command-line (CLI)
- **Tech Stack:** Python 3.7+, yt-dlp.exe, mutagen (for ID3 tagging)
---
## 📥 Input
- YouTube channel or playlist URLs (e.g. `https://www.youtube.com/@SingKingKaraoke/videos`)
- Optional: `data/channels.txt` file with multiple channel URLs (one per line) - **now defaults to this file if not specified**
- Optional: `data/songList.json` for prioritized song downloads
### Example Usage
```bash
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --songlist-only --limit 5
python download_karaoke.py --latest-per-channel --limit 3
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache SingKingKaraoke
```
---
## 📤 Output
- MP4 files in `downloads/<ChannelName>/` subfolders
- All videos tracked in `data/karaoke_tracking.json`
- Songlist progress tracked in `data/songlist_tracking.json`
- Logs in `logs/`
---
## 🛠️ Features
- ✅ Channel-based downloads (with per-channel folders)
- ✅ Robust JSON tracking (downloaded, partial, failed, etc.)
- ✅ Batch saving and channel video caching for performance
- ✅ Configurable download resolution and yt-dlp options (`data/config.json`)
- ✅ Songlist integration: prioritize and track custom songlists
- ✅ Songlist-only mode: download only songs from the songlist
- ✅ Global songlist tracking to avoid duplicates across channels
- ✅ ID3 tagging for artist/title in MP4 files (mutagen)
- ✅ Real-time progress and detailed logging
- ✅ Automatic cleanup of extra yt-dlp files
-**Reset/clear channel tracking and files via CLI**
-**Clear channel cache via CLI**
-**Download plan pre-scan and caching**: Before downloading, the tool pre-scans all channels for songlist matches, builds a download plan, and prints stats. The plan is cached for 1 day in data/download_plan_cache.json for fast resuming and reliability. Use --force-download-plan to force a refresh.
-**Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
-**Fast mode with early exit**: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. If a download fails, it continues scanning until the limit is satisfied or all channels are exhausted.
-**Deduplication across channels**: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates.
-**Fuzzy matching**: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib.
-**Default channel file**: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list.
-**Robust interruption handling**: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
-**Optimized scanning performance**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching of large songlists and channels.
-**Centralized yt-dlp command generation**: Standardized command building and execution across all download operations
-**Enhanced error handling**: Structured exception hierarchy with consistent error messages and formatting
-**Abstracted download pipeline**: Reusable download → verify → tag → track process for consistent processing
-**Reduced code duplication**: Eliminated duplicate code across modules through centralized utilities
---
## 📂 Folder Structure
```
KaroakeVideoDownloader/
├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── tracking_manager.py # Download tracking and status management
│ ├── download_planner.py # Download plan building and channel scanning
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── channel_manager.py # Channel and file management operations
│ ├── songlist_manager.py # Songlist operations and tracking
│ ├── server_manager.py # Server song availability checking
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
│ ├── error_utils.py # Standardized error handling and formatting
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
│ ├── id3_utils.py # ID3 tagging utilities
│ ├── config_manager.py # Configuration management
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.txt
│ └── songList.json
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat # (optional Windows launcher)
```
---
## 🚦 CLI Options (Summary)
- `--file <data/channels.txt>`: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
- `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist
- `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution
- `--status`: Show download/tracking status
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--force-download-plan`: **Force refresh the download plan cache (re-scan all channels for matches)**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: **Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)**
- `--fuzzy-threshold <N>`: **Fuzzy match threshold (0-100, default 85)**
---
## 🧠 Logic Highlights
- **Tracking:** All downloads, statuses, and formats are tracked in JSON files for reliability and deduplication.
- **Songlist:** Loads and normalizes `data/songList.json`, matches against available videos, and prioritizes or restricts downloads accordingly.
- **Batch/Caching:** Channel video lists are cached to minimize API calls; tracking is batch-saved for performance.
- **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files.
- **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download.
- **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels.
## 🔧 Refactoring Improvements (v3.2)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication:
### **Centralized Utilities**
- **`youtube_utils.py`**: Centralized yt-dlp command generation and YouTube operations
- **`error_utils.py`**: Standardized error handling with structured exception hierarchy
- **`download_pipeline.py`**: Abstracted download pipeline for consistent processing
### **Benefits Achieved**
- **Reduced Duplication**: Eliminated ~50 lines of duplicated yt-dlp command generation
- **Improved Maintainability**: Changes to yt-dlp configuration only require updates in one place
- **Enhanced Error Handling**: Consistent error messages and better debugging context
- **Better Code Organization**: Clear separation of concerns and logical module structure
- **Increased Testability**: Modular components can be tested independently
- **Download plan pre-scan:** Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
- **Latest-per-channel plan:** Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
- **Fast mode with early exit:** When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
- **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
- **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
- **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
- **Robust interruption handling:** Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- **Optimized scanning algorithm:** High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
---
## 🚀 Future Enhancements
- [ ] Web UI for easier management
- [ ] More advanced song matching (multi-language)
- [ ] Download scheduling and retry logic
- [ ] More granular status reporting
- [ ] Parallel downloads for improved speed
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows