29 KiB
🎤 Karaoke Video Downloader – PRD (v3.4.3)
✅ Overview
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using yt-dlp.exe, with advanced tracking, songlist prioritization, and flexible configuration. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.
🏗️ Architecture
The codebase has been refactored into focused modules with centralized utilities:
Core Modules:
downloader.py: Main orchestrator and CLI interfacevideo_downloader.py: Core video download execution and orchestrationtracking_manager.py: Download tracking and status managementdownload_planner.py: Download plan building and channel scanningcache_manager.py: Cache operations and file I/O managementchannel_manager.py: Channel and file management operationssonglist_manager.py: Songlist operations and trackingserver_manager.py: Server song availability checkingfuzzy_matcher.py: Fuzzy matching logic and similarity functions
Utility Modules (v3.2):
youtube_utils.py: Centralized YouTube operations and yt-dlp command generationerror_utils.py: Standardized error handling and formattingdownload_pipeline.py: Abstracted download → verify → tag → track pipelineid3_utils.py: ID3 tagging utilitiesconfig_manager.py: Configuration managementresolution_cli.py: Resolution checking utilitiestracking_cli.py: Tracking management CLI
New Utility Modules (v3.3):
file_utils.py: Centralized file operations, filename sanitization, and file validationsong_validator.py: Centralized song validation logic for checking if songs should be downloaded
Benefits of Enhanced Modular Architecture:
- Single Responsibility: Each module has a focused purpose
- Centralized Utilities: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
- Reduced Duplication: Eliminated ~150 lines of code duplication across modules
- Testability: Individual components can be tested separately
- Maintainability: Easier to find and fix issues
- Reusability: Components can be used independently
- Robustness: Better error handling and interruption recovery
- Consistency: Standardized error messages and processing pipelines
- Type Safety: Comprehensive type hints across all new modules
📋 Goals
- Download karaoke videos from YouTube channels or playlists.
- Organize downloads by channel (or playlist) in subfolders.
- Avoid re-downloading the same videos (robust tracking).
- Prioritize and track a custom songlist across channels.
- Allow flexible, user-friendly configuration.
- Provide robust interruption handling and progress recovery.
🧑💻 Target Users
- Karaoke DJs, home karaoke users, event hosts, or anyone needing offline karaoke video libraries.
- Users comfortable with command-line tools.
⚙️ Platform & Stack
- Platform: Windows
- Interface: Command-line (CLI)
- Tech Stack: Python 3.7+, yt-dlp.exe, mutagen (for ID3 tagging)
📥 Input
- YouTube channel or playlist URLs (e.g.
https://www.youtube.com/@SingKingKaraoke/videos) - Optional:
data/channels.txtfile with multiple channel URLs (one per line) - now defaults to this file if not specified - Optional:
data/songList.jsonfor prioritized song downloads
Example Usage
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --songlist-only --limit 5
python download_karaoke.py --latest-per-channel --limit 3
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache SingKingKaraoke
📤 Output
- MP4 files in
downloads/<ChannelName>/subfolders - All videos tracked in
data/karaoke_tracking.json - Songlist progress tracked in
data/songlist_tracking.json - Logs in
logs/
🛠️ Features
- ✅ Channel-based downloads (with per-channel folders)
- ✅ Robust JSON tracking (downloaded, partial, failed, etc.)
- ✅ Batch saving and channel video caching for performance
- ✅ Configurable download resolution and yt-dlp options (
data/config.json) - ✅ Songlist integration: prioritize and track custom songlists
- ✅ Songlist-only mode: download only songs from the songlist
- ✅ Songlist focus mode: download only songs from specific playlists by title
- ✅ Force download mode: bypass all existing file checks and re-download songs regardless of server duplicates or existing files
- ✅ Global songlist tracking to avoid duplicates across channels
- ✅ ID3 tagging for artist/title in MP4 files (mutagen)
- ✅ Real-time progress and detailed logging
- ✅ Automatic cleanup of extra yt-dlp files
- ✅ Reset/clear channel tracking and files via CLI
- ✅ Clear channel cache via CLI
- ✅ Download plan pre-scan and caching: Before downloading, the tool pre-scans all channels for songlist matches, builds a download plan, and prints stats. The plan is cached for 1 day in data/download_plan_cache.json for fast resuming and reliability. Use --force-download-plan to force a refresh.
- ✅ Latest-per-channel download: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- ✅ Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. If a download fails, it continues scanning until the limit is satisfied or all channels are exhausted.
- ✅ Deduplication across channels: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates.
- ✅ Fuzzy matching: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib.
- ✅ Default channel file: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list.
- ✅ Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- ✅ Optimized scanning performance: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching of large songlists and channels.
- ✅ Centralized yt-dlp command generation: Standardized command building and execution across all download operations
- ✅ Enhanced error handling: Structured exception hierarchy with consistent error messages and formatting
- ✅ Abstracted download pipeline: Reusable download → verify → tag → track process for consistent processing
- ✅ Reduced code duplication: Eliminated duplicate code across modules through centralized utilities
- ✅ Centralized file operations: Single source of truth for filename sanitization, file validation, and path operations
- ✅ Centralized song validation: Unified logic for checking if songs should be downloaded across all modules
- ✅ Enhanced configuration management: Structured configuration with dataclasses, type safety, and validation
- ✅ Manual video collection: Static video collection system for managing individual karaoke videos that don't belong to regular channels. Use
--manualto download fromdata/manual_videos.json. - ✅ Channel-specific parsing rules: JSON-based configuration for parsing video titles from different YouTube channels, with support for various title formats and cleanup rules.
📂 Folder Structure
KaroakeVideoDownloader/
├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── tracking_manager.py # Download tracking and status management
│ ├── download_planner.py # Download plan building and channel scanning
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── channel_manager.py # Channel and file management operations
│ ├── songlist_manager.py # Songlist operations and tracking
│ ├── server_manager.py # Server song availability checking
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
│ ├── error_utils.py # Standardized error handling and formatting
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
│ ├── id3_utils.py # ID3 tagging utilities
│ ├── config_manager.py # Configuration management with dataclasses
│ ├── file_utils.py # Centralized file operations and filename handling
│ ├── song_validator.py # Centralized song validation logic
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.json # Channel configuration with parsing rules
│ ├── channels.txt # Legacy channel list (backward compatibility)
│ ├── manual_videos.json # Manual video collection
│ └── songList.json
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat # (optional Windows launcher)
🚦 CLI Options (Summary)
--file <data/channels.txt>: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)--songlist-priority: Prioritize songlist songs in download queue--songlist-only: Download only songs from the songlist--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...: Focus on specific playlists by title (e.g.,--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")--songlist-file <FILE_PATH>: Custom songlist file path to use with --songlist-focus (default: data/songList.json)--force: Force download from channels, bypassing all existing file checks and re-downloading if necessary--songlist-status: Show songlist download progress--limit <N>: Limit number of downloads (enables fast mode with early exit)--resolution <720p|1080p|...>: Override resolution--status: Show download/tracking status--reset-channel <CHANNEL_NAME>: Reset all tracking and files for a channel--reset-songlist: When used with --reset-channel, also reset songlist songs for this channel--clear-cache <CHANNEL_ID|all>: Clear channel video cache for a specific channel or all--force-download-plan: Force refresh the download plan cache (re-scan all channels for matches)--latest-per-channel: Download the latest N videos from each channel (use with --limit)--fuzzy-match: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)--fuzzy-threshold <N>: Fuzzy match threshold (0-100, default 85)--parallel: Enable parallel downloads for improved speed--workers <N>: Number of parallel download workers (1-10, default: 3, only used with --parallel)--manual: Download from manual videos collection (data/manual_videos.json)--channel-focus <CHANNEL_NAME>: Download from a specific channel by name (e.g., 'SingKingKaraoke')--all-videos: Download all videos from channel (not just songlist matches), skipping existing files and songs in songs.json
🧠 Logic Highlights
- Tracking: All downloads, statuses, and formats are tracked in JSON files for reliability and deduplication.
- Songlist: Loads and normalizes
data/songList.json, matches against available videos, and prioritizes or restricts downloads accordingly. - Batch/Caching: Channel video lists are cached to minimize API calls; tracking is batch-saved for performance.
- ID3 Tagging: Artist/title extracted from video title and embedded in MP4 files.
- Cleanup: Extra files from yt-dlp (e.g.,
.info.json) are automatically removed after download. - Reset/Clear: Use
--reset-channelto reset all tracking and files for a channel (optionally including songlist songs with--reset-songlist). Use--clear-cacheto clear cached video lists for a channel or all channels. - Channel-Specific Parsing: Uses
data/channels.jsonto define parsing rules for each YouTube channel, handling different video title formats (e.g., "Artist - Title", "Artist Title", "Title | Artist", etc.). - Manual Video Collection: Static video management system using
data/manual_videos.jsonfor individual karaoke videos that don't belong to regular channels. Accessible via--manualparameter.
🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
New Utility Modules (v3.3)
-
file_utils.py: Centralized file operations, filename sanitization, and file validationsanitize_filename(): Create safe filenames from artist/titlegenerate_possible_filenames(): Generate filename patterns for different modescheck_file_exists_with_patterns(): Check for existing files using multiple patternsis_valid_mp4_file(): Validate MP4 files with header checkingcleanup_temp_files(): Remove temporary yt-dlp filesensure_directory_exists(): Safe directory creation
-
song_validator.py: Centralized song validation logicSongValidatorclass: Unified logic for checking if songs should be downloadedshould_skip_song(): Comprehensive validation with multiple criteriamark_song_failed(): Consistent failure trackinghandle_download_failure(): Standardized error handling
-
Enhanced
config_manager.py: Robust configuration management with dataclassesConfigManagerclass: Type-safe configuration loading and cachingDownloadSettings,FolderStructure,LoggingConfigdataclasses- Configuration validation and merging with defaults
- Dynamic resolution updates
Benefits Achieved
- Eliminated Code Duplication: ~150 lines of duplicate code removed across modules
- Centralized File Operations: Single source of truth for filename handling and file validation
- Unified Song Validation: Consistent logic for checking if songs should be downloaded
- Enhanced Type Safety: Comprehensive type hints across all new modules
- Improved Configuration Management: Structured configuration with validation and caching
- Better Error Handling: Consistent patterns via centralized utilities
- Enhanced Maintainability: Changes to file operations or song validation only require updates in one place
- Improved Testability: Modular components can be tested independently
- Better Developer Experience: Clear function signatures and comprehensive documentation
Previous Improvements (v3.2)
- Centralized yt-dlp Command Generation: Standardized command building and execution across all download operations
- Enhanced Error Handling: Structured exception hierarchy with consistent error messages and formatting
- Abstracted Download Pipeline: Reusable download → verify → tag → track process for consistent processing
- Download plan pre-scan: Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
- Latest-per-channel plan: Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
- Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
- Deduplication across channels: Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
- Fuzzy matching: Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
- Default channel file: For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
- Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- Optimized scanning algorithm: High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
- Enhanced cache management: Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
- Robust download plan execution: Fixed index management in download plan execution to prevent errors during interrupted downloads.
New Parallel Download System (v3.4)
- Parallel downloader module:
parallel_downloader.pyprovides thread-safe concurrent download management - Configurable concurrency: Use
--parallelto enable parallel downloads with 3 workers by default, or--parallel --workers Nfor custom worker count (1-10) - Thread-safe operations: All tracking, caching, and progress operations are thread-safe
- Real-time progress tracking: Shows active downloads, completion status, and overall progress
- Automatic retry mechanism: Failed downloads are automatically retried with reduced concurrency
- Backward compatibility: Sequential downloads remain the default when
--parallelis not used - Performance improvements: Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
- Integrated with all modes: Works with both songlist-across-channels and latest-per-channel download modes
🚀 Future Enhancements
- Web UI for easier management
- More advanced song matching (multi-language)
- Download scheduling and retry logic
- More granular status reporting
- Parallel downloads for improved speed ✅ COMPLETED
- Enhanced fuzzy matching with improved video title parsing ✅ COMPLETED
- Consolidated extract_artist_title function ✅ COMPLETED
- Duplicate file prevention and filename consistency ✅ COMPLETED
- Unit tests for all modules
- Integration tests for end-to-end workflows
- Plugin system for custom file operations
- Advanced configuration UI
- Real-time download progress visualization
🔧 Recent Bug Fixes & Improvements (v3.4.1)
Enhanced Fuzzy Matching (v3.4.1)
- Improved
extract_artist_titlefunction: Enhanced to handle multiple video title formats beyond simple "Artist - Title" patterns- "Title Karaoke | Artist Karaoke Version" format: Correctly parses titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
- "Title Artist KARAOKE" format: Handles titles ending with "KARAOKE" and attempts to extract artist information
- Fallback handling: Returns empty artist and full title for unparseable formats
- Consolidated function usage: Removed duplicate
extract_artist_titleimplementations across modules- Single source of truth: All modules now import from
fuzzy_matcher.py - Consistent parsing: Eliminated inconsistencies between different parsing implementations
- Better maintainability: Changes to parsing logic only need to be made in one place
- Single source of truth: All modules now import from
Fixed Import Conflicts
- Resolved import conflict in
download_planner.py: Updated to use the enhancedextract_artist_titlefromfuzzy_matcher.pyinstead of the simpler version fromid3_utils.py - Updated
id3_utils.py: Now importsextract_artist_titlefromfuzzy_matcher.pyfor consistency
Enhanced --limit Parameter
- Fixed limit application: The
--limitparameter now correctly applies to the scanning phase, not just the download execution - Improved performance: When using
--limit N, only the first N songs are scanned against channels, significantly reducing processing time for large songlists
Benefits of Recent Improvements
- Better matching accuracy: Enhanced fuzzy matching can now handle a wider variety of video title formats commonly found on YouTube karaoke channels
- Reduced false negatives: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
- Consistent behavior: All parts of the system use the same parsing logic, eliminating edge cases where different modules would parse the same title differently
- Improved performance: The
--limitparameter now works as expected, providing faster processing for targeted downloads - Cleaner codebase: Eliminated duplicate code and import conflicts, making the system more maintainable
🔧 Recent Bug Fixes & Improvements (v3.4.2)
Duplicate File Prevention & Filename Consistency
- Enhanced file existence checking:
check_file_exists_with_patterns()now detects files with(2),(3), etc. suffixes that yt-dlp creates - Automatic duplicate prevention: Download pipeline skips downloads when files already exist (including duplicates)
- Updated yt-dlp configuration: Set
"nooverwrites": falseto prevent yt-dlp from creating duplicate files with suffixes - Cleanup utility:
data/cleanup_duplicate_files.pyprovides interactive cleanup of existing duplicate files - Filename vs ID3 tag consistency: Removed "(Karaoke Version)" suffix from ID3 tags to match filenames exactly
- Unified parsing: Both filename generation and ID3 tagging use the same artist/title extraction logic
Benefits of Duplicate Prevention
- No more duplicate files: Eliminates
(2),(3)suffix files that waste disk space - Consistent metadata: Filename and ID3 tag use identical artist/title format
- Efficient disk usage: Prevents unnecessary downloads of existing files
- Clear file identification: Consistent naming across all file operations
🛠️ Maintenance
Regular Cleanup
- Run the cleanup utility periodically to remove any duplicate files
- Monitor downloads for any new duplicate creation (should be rare with fixes)
Configuration
- Keep
"nooverwrites": falseindata/config.json - This prevents yt-dlp from creating duplicate files
Monitoring
- Check logs for "⏭️ Skipping download - file already exists" messages
- These indicate the duplicate prevention is working correctly
🔧 Recent Bug Fixes & Improvements (v3.4.3)
Manual Video Collection System
- New
--manualparameter: Simple access to manual video collection viapython download_karaoke.py --manual --limit 5 - Static video management:
data/manual_videos.jsonstores individual karaoke videos that don't belong to regular channels - Helper script:
add_manual_video.pyprovides easy management of manual video entries - Full integration: Manual videos work with all existing features (songlist matching, fuzzy matching, parallel downloads, etc.)
- No yt-dlp dependency: Manual videos bypass YouTube API calls for video listing, using static data instead
Channel-Specific Parsing Rules
- JSON-based configuration:
data/channels.jsonreplacesdata/channels.txtwith structured channel configuration - Parsing rules per channel: Each channel can define custom parsing rules for video titles
- Multiple format support: Handles various title formats like "Artist - Title", "Artist Title", "Title | Artist", etc.
- Suffix cleanup: Automatic removal of common karaoke-related suffixes
- Multi-artist support: Parsing for titles with multiple artists separated by specific delimiters
- Backward compatibility: Still supports legacy
data/channels.txtformat
Benefits of New Features
- Flexible video management: Easy addition of individual karaoke videos without creating new channels
- Accurate parsing: Channel-specific rules ensure correct artist/title extraction for ID3 tags and filenames
- Consistent metadata: Proper parsing prevents filename and ID3 tag inconsistencies
- Easy maintenance: Simple JSON structure for managing both channels and manual videos
- Full feature compatibility: Manual videos work seamlessly with existing download modes and features
📚 Documentation Standards
Documentation Location
- All changes, refactoring, and improvements should be documented in the PRD.md and README.md files
- Do NOT create separate .md files for documenting changes, refactoring, or improvements
- Use the existing sections in PRD.md and README.md to track all project evolution
Where to Document Changes
- PRD.md: Technical details, architecture changes, bug fixes, and implementation specifics
- README.md: User-facing features, usage instructions, and high-level improvements
- CHANGELOG.md: Version-specific release notes and change summaries
Documentation Requirements
- All new features must be documented in both PRD.md and README.md
- All refactoring efforts must be documented in the appropriate sections
- All bug fixes must be documented with technical details
- Version numbers and dates should be clearly marked
- Benefits and improvements should be explicitly stated
Maintenance Responsibility
- Keep PRD.md and README.md synchronized with code changes
- Update documentation immediately when implementing new features
- Remove outdated information and consolidate related changes
- Ensure all CLI options and features are documented in both files
🔧 Recent Bug Fixes & Improvements (v3.4.4)
All Videos Download Mode
- New
--all-videosparameter: Download all videos from a channel, not just songlist matches - Smart MP3/MP4 detection: Automatically detects if you have MP3 versions in songs.json and downloads MP4 video versions
- Existing file skipping: Skips videos that already exist on the filesystem
- Progress tracking: Shows clear progress with "Downloading X/Y videos" format
- Parallel processing support: Works with
--parallel --workers Nfor faster downloads - Channel focus integration: Works with
--channel-focusto target specific channels - Limit support: Works with
--limit Nto control download batch size
Smart Songlist Integration
- MP4 version detection: Checks if MP4 version already exists in songs.json before downloading
- MP3 upgrade path: Downloads MP4 video versions when only MP3 versions exist in songlist
- Duplicate prevention: Skips downloads when MP4 versions already exist
- Efficient filtering: Only processes videos that need to be downloaded
Benefits of All Videos Mode
- Complete channel downloads: Download entire channels without songlist restrictions
- Automatic format upgrading: Upgrade MP3 collections to MP4 video versions
- Efficient processing: Only downloads videos that don't already exist
- Flexible control: Use with limits, parallel processing, and channel targeting
- Clear progress feedback: Real-time progress tracking for large downloads
🚀 Future Enhancements
- Web UI for easier management
- More advanced song matching (multi-language)
- Download scheduling and retry logic
- More granular status reporting
- Parallel downloads for improved speed ✅ COMPLETED
- Enhanced fuzzy matching with improved video title parsing ✅ COMPLETED
- Consolidated extract_artist_title function ✅ COMPLETED
- Duplicate file prevention and filename consistency ✅ COMPLETED
- Unit tests for all modules
- Integration tests for end-to-end workflows
- Plugin system for custom file operations
- Advanced configuration UI
- Real-time download progress visualization