KaraokeVideoDownloader/README.md

260 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎤 Karaoke Video Downloader
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration.
## ✨ Features
- 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist
- 📂 **Organized Storage**: Each channel gets its own folder in `downloads/`
- 📝 **Robust Tracking**: Tracks all downloads, statuses, and formats in JSON
- 🏆 **Songlist Prioritization**: Prioritize or restrict downloads to a custom songlist
- 🔄 **Batch Saving & Caching**: Efficient, minimizes API calls
- 🏷️ **ID3 Tagging**: Adds artist/title metadata to MP4 files
- 🧹 **Automatic Cleanup**: Removes extra yt-dlp files
- 📈 **Real-Time Progress**: Detailed console and log output
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- 🧩 **Fuzzy Matching**: Optionally use fuzzy string matching for songlist-to-video matching (with --fuzzy-match, requires rapidfuzz for best results)
-**Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
- 🛡️ **Robust Interruption Handling**: Progress is saved after each download, preventing re-downloads if the process is interrupted
-**Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
## 🏗️ Architecture
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
### Core Modules:
- **`downloader.py`**: Main orchestrator and CLI interface
- **`video_downloader.py`**: Core video download execution and orchestration
- **`tracking_manager.py`**: Download tracking and status management
- **`download_planner.py`**: Download plan building and channel scanning
- **`cache_manager.py`**: Cache operations and file I/O management
- **`channel_manager.py`**: Channel and file management operations
- **`songlist_manager.py`**: Songlist operations and tracking
- **`server_manager.py`**: Server song availability checking
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
### Utility Modules:
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
- **`error_utils.py`**: Standardized error handling and formatting
- **`download_pipeline.py`**: Abstracted download → verify → tag → track pipeline
- **`id3_utils.py`**: ID3 tagging utilities
- **`config_manager.py`**: Configuration management
- **`resolution_cli.py`**: Resolution checking utilities
- **`tracking_cli.py`**: Tracking management CLI
### Benefits:
- **Centralized Utilities**: Common operations (yt-dlp commands, error handling) are centralized
- **Reduced Duplication**: Eliminated code duplication across modules
- **Consistency**: Standardized error messages and processing pipelines
- **Maintainability**: Changes isolated to specific modules
- **Testability**: Modular components can be tested independently
## 📋 Requirements
- **Windows 10/11**
- **Python 3.7+**
- **yt-dlp.exe** (in `downloader/`)
- **mutagen** (for ID3 tagging, optional)
- **ffmpeg/ffprobe** (for video validation, optional but recommended)
- **rapidfuzz** (for fuzzy matching, optional, falls back to difflib)
## 🚀 Quick Start
### Download a Channel
```bash
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
```
### Download Only Songlist Songs (Fast Mode)
```bash
python download_karaoke.py --songlist-only --limit 5
```
### Download with Fuzzy Matching
```bash
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
```
### Download Latest N Videos Per Channel
```bash
python download_karaoke.py --latest-per-channel --limit 5
```
### Download Latest N Videos Per Channel (with fuzzy matching)
```bash
python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
```
### Prioritize Songlist in Download Queue
```bash
python download_karaoke.py --songlist-priority
```
### Show Songlist Download Progress
```bash
python download_karaoke.py --songlist-status
```
### Limit Number of Downloads
```bash
python download_karaoke.py --limit 5
```
### Override Resolution
```bash
python download_karaoke.py --resolution 1080p
```
### **Reset/Start Over for a Channel**
```bash
python download_karaoke.py --reset-channel SingKingKaraoke
```
### **Reset Channel and Songlist Songs**
```bash
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
```
### **Clear Channel Cache**
```bash
python download_karaoke.py --clear-cache SingKingKaraoke
python download_karaoke.py --clear-cache all
```
## 🧠 Songlist Integration
- Place your prioritized song list in `data/songList.json` (see example format below).
- The tool will match and prioritize these songs across all available channel videos.
- Use `--songlist-only` to download only these songs, or `--songlist-priority` to prioritize them in the queue.
- Download progress for the songlist is tracked globally in `data/songlist_tracking.json`.
#### Example `data/songList.json`
```json
[
{ "artist": "Taylor Swift", "title": "Cruel Summer" },
{ "artist": "Billie Eilish", "title": "Happier Than Ever" }
]
```
## 🛠️ Tracking & Caching
- **data/karaoke_tracking.json**: Tracks all downloads, statuses, and formats
- **data/songlist_tracking.json**: Tracks global songlist download progress
- **data/server_duplicates_tracking.json**: Tracks songs found to be duplicates on the server for future skipping
- **data/channel_cache.json**: Caches channel video lists for performance
## 📂 Folder Structure
```
KaroakeVideoDownloader/
├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── tracking_manager.py # Download tracking and status management
│ ├── download_planner.py # Download plan building and channel scanning
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── channel_manager.py # Channel and file management operations
│ ├── songlist_manager.py # Songlist operations and tracking
│ ├── server_manager.py # Server song availability checking
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
│ ├── error_utils.py # Standardized error handling and formatting
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
│ ├── id3_utils.py # ID3 tagging utilities
│ ├── config_manager.py # Configuration management
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.txt
│ └── songList.json
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat # (optional Windows launcher)
```
## 🚦 CLI Options
- `--file <data/channels.txt>`: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
- `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist
- `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution
- `--status`: Show download/tracking status
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--clear-server-duplicates`: **Clear server duplicates tracking (allows re-checking songs against server)**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
## 📝 Example Usage
```bash
# Fast mode with fuzzy matching (no need to specify --file)
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
# Latest videos per channel
python download_karaoke.py --latest-per-channel --limit 5
# Traditional full scan (no limit)
python download_karaoke.py --songlist-only
# Channel-specific operations
python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates
```
## 🏷️ ID3 Tagging
- Adds artist/title/album/genre to MP4 files using mutagen (if installed)
## 🧹 Cleanup
- Removes `.info.json` and `.meta` files after download
## 🛠️ Configuration
- All options are in `data/config.json` (format, resolution, metadata, etc.)
- You can edit this file or use CLI flags to override
## 🔧 Refactoring Improvements (v3.2)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication:
### **Key Improvements**
- **Centralized yt-dlp Command Generation**: Standardized command building and execution across all download operations
- **Enhanced Error Handling**: Structured exception hierarchy with consistent error messages and formatting
- **Abstracted Download Pipeline**: Reusable download → verify → tag → track process for consistent processing
- **Reduced Code Duplication**: Eliminated duplicate code across modules through centralized utilities
### **New Utility Modules**
- **`youtube_utils.py`**: Centralized YouTube operations and yt-dlp command generation
- **`error_utils.py`**: Standardized error handling with structured exception hierarchy
- **`download_pipeline.py`**: Abstracted download pipeline for consistent processing
### **Benefits**
- **Improved Maintainability**: Changes to yt-dlp configuration only require updates in one place
- **Better Error Handling**: Consistent error messages and better debugging context
- **Enhanced Testability**: Modular components can be tested independently
- **Reduced Complexity**: Single source of truth for common operations
## 🐞 Troubleshooting
- Ensure `yt-dlp.exe` is in the `downloader/` folder
- Check `logs/` for error details
- Use `python -m karaoke_downloader.check_resolution` to verify video quality
- If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH
- For best fuzzy matching, install rapidfuzz: `pip install rapidfuzz` (otherwise falls back to slower, less accurate difflib)
---
**Happy Karaoke! 🎤**