| data | ||
| downloader | ||
| karaoke_downloader | ||
| .flake8 | ||
| .gitignore | ||
| build_cache_from_raw.py | ||
| commands.txt | ||
| download_karaoke.bat | ||
| download_karaoke.py | ||
| fix_code_quality.py | ||
| PRD.md | ||
| pyproject.toml | ||
| README.md | ||
| requirements.txt | ||
🎤 Karaoke Video Downloader
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using yt-dlp.exe, with advanced tracking, songlist prioritization, and flexible configuration.
✨ Features
- 🎵 Channel & Playlist Downloads: Download all videos from a YouTube channel or playlist
- 📂 Organized Storage: Each channel gets its own folder in
downloads/ - 📝 Robust Tracking: Tracks all downloads, statuses, and formats in JSON
- 🏆 Songlist Prioritization: Prioritize or restrict downloads to a custom songlist
- 🔄 Batch Saving & Caching: Efficient, minimizes API calls
- 🏷️ ID3 Tagging: Adds artist/title metadata to MP4 files
- 🧹 Automatic Cleanup: Removes extra yt-dlp files
- 📈 Real-Time Progress: Detailed console and log output
- 🧹 Reset/Clear Channel: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ Latest-per-channel download: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- 🧩 Enhanced Fuzzy Matching: Advanced fuzzy string matching for songlist-to-video matching with improved video title parsing (handles multiple title formats like "Title Karaoke | Artist Karaoke Version")
- ⚡ Fast Mode with Early Exit: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
- 🔄 Deduplication Across Channels: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
- 📋 Default Channel File: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
- 🛡️ Robust Interruption Handling: Progress is saved after each download, preventing re-downloads if the process is interrupted
- ⚡ Optimized Scanning: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
- 🏷️ Server Duplicates Tracking: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
- ⚡ Parallel Downloads: Enable concurrent downloads with
--parallel --workers Nfor significantly faster batch downloads (3-5x speedup) - 📊 Unmatched Songs Reports: Generate detailed reports of songs that couldn't be found in any channel with
--generate-unmatched-report - 🛡️ Duplicate File Prevention: Automatically detects and prevents duplicate files with
(2),(3)suffixes, with cleanup utility for existing duplicates - 🏷️ Consistent Metadata: Filename and ID3 tag use identical artist/title format for clear file identification
🏗️ Architecture
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
Core Modules:
downloader.py: Main orchestrator and CLI interfacevideo_downloader.py: Core video download execution and orchestrationtracking_manager.py: Download tracking and status managementdownload_planner.py: Download plan building and channel scanningcache_manager.py: Cache operations and file I/O managementchannel_manager.py: Channel and file management operationssonglist_manager.py: Songlist operations and trackingserver_manager.py: Server song availability checkingfuzzy_matcher.py: Fuzzy matching logic and similarity functions
Utility Modules (v3.2):
youtube_utils.py: Centralized YouTube operations and yt-dlp command generationerror_utils.py: Standardized error handling and formattingdownload_pipeline.py: Abstracted download → verify → tag → track pipelineid3_utils.py: ID3 tagging utilitiesconfig_manager.py: Configuration managementresolution_cli.py: Resolution checking utilitiestracking_cli.py: Tracking management CLI
New Utility Modules (v3.3):
-
parallel_downloader.py: Parallel download management with thread-safe operationsParallelDownloaderclass: Manages concurrent downloads with configurable workersDownloadTaskandDownloadResultdataclasses: Structured task and result management- Thread-safe progress tracking and error handling
- Automatic retry mechanism for failed downloads
-
file_utils.py: Centralized file operations, filename sanitization, and file validationsanitize_filename(): Create safe filenames from artist/titlegenerate_possible_filenames(): Generate filename patterns for different modescheck_file_exists_with_patterns(): Check for existing files using multiple patternsis_valid_mp4_file(): Validate MP4 files with header checkingcleanup_temp_files(): Remove temporary yt-dlp filesensure_directory_exists(): Safe directory creation
-
song_validator.py: Centralized song validation logicSongValidatorclass: Unified logic for checking if songs should be downloadedshould_skip_song(): Comprehensive validation with multiple criteriamark_song_failed(): Consistent failure trackinghandle_download_failure(): Standardized error handling
-
Enhanced
config_manager.py: Robust configuration management with dataclassesConfigManagerclass: Type-safe configuration loading and cachingDownloadSettings,FolderStructure,LoggingConfigdataclasses- Configuration validation and merging with defaults
- Dynamic resolution updates
Benefits:
- Centralized Utilities: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
- Reduced Duplication: Eliminated ~150 lines of code duplication across modules
- Consistency: Standardized error messages and processing pipelines
- Maintainability: Changes isolated to specific modules
- Testability: Modular components can be tested independently
- Type Safety: Comprehensive type hints across all new modules
🔧 Recent Improvements (v3.4.1)
Enhanced Fuzzy Matching
- Improved video title parsing: The
extract_artist_titlefunction now handles multiple title formats:"Title Karaoke | Artist Karaoke Version"→ Artist: "38 Special", Title: "Hold On Loosely""Title Artist KARAOKE"→ Attempts to extract artist from complex titles"Artist - Title"→ Standard format (unchanged)
- Consolidated parsing logic: All modules now use the same
extract_artist_titlefunction fromfuzzy_matcher.py - Better matching accuracy: Reduced false negatives for songs with non-standard title formats
Fixed --limit Parameter
- Correct limit application: The
--limitparameter now properly limits the scanning phase, not just downloads - Improved performance: When using
--limit N, only the first N songs are scanned, significantly reducing processing time - Accurate logging: Logging messages now show the correct counts for songs that will actually be processed when using
--limit
Code Quality Improvements
- Eliminated duplicate functions: Removed duplicate
extract_artist_titleimplementations - Fixed import conflicts: Resolved inconsistencies between different parsing implementations
- Single source of truth: All title parsing logic is now centralized in
fuzzy_matcher.py
🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)
Duplicate File Prevention
- Enhanced file existence checking: Now detects files with
(2),(3), etc. suffixes that yt-dlp creates - Automatic duplicate prevention: Skips downloads when files already exist (including duplicates)
- Updated yt-dlp configuration: Set
"nooverwrites": falseto prevent yt-dlp from creating duplicate files - Cleanup utility:
data/cleanup_duplicate_files.pyhelps identify and remove existing duplicate files
Filename vs ID3 Tag Consistency
- Consistent metadata: Filename and ID3 tag now use identical artist/title format
- Removed extra suffixes: No more "(Karaoke Version)" in ID3 tags that don't match filenames
- Unified parsing: Both filename generation and ID3 tagging use the same artist/title extraction
Benefits
- ✅ No more duplicate files with
(2),(3)suffixes - ✅ Consistent metadata between filename and ID3 tags
- ✅ Efficient disk usage by preventing unnecessary downloads
- ✅ Clear file identification with consistent naming
Clean Up Existing Duplicates
# Run the cleanup utility to find and remove existing duplicates
python data/cleanup_duplicate_files.py
# Choose option 1 for dry run (recommended first)
# Choose option 2 to actually delete duplicates
📋 Requirements
- Windows 10/11
- Python 3.7+
- yt-dlp.exe (in
downloader/) - mutagen (for ID3 tagging, optional)
- ffmpeg/ffprobe (for video validation, optional but recommended)
- rapidfuzz (for fuzzy matching, optional, falls back to difflib)
🚀 Quick Start
💡 Pro Tip: For a complete list of all available commands, see
commands.txt- you can copy/paste any command directly into your terminal!
Download a Channel
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
Download Only Songlist Songs (Fast Mode)
python download_karaoke.py --songlist-only --limit 5
Download with Parallel Processing
python download_karaoke.py --parallel --songlist-only --limit 10
Focus on Specific Playlists by Title
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"
Focus on Specific Playlists from Custom File
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
Force Download from Channels (Bypass All Existing File Checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
Download with Fuzzy Matching
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
Download Latest N Videos Per Channel
python download_karaoke.py --latest-per-channel --limit 5
Download Latest N Videos Per Channel (with fuzzy matching)
python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
Prioritize Songlist in Download Queue
python download_karaoke.py --songlist-priority
Show Songlist Download Progress
python download_karaoke.py --songlist-status
Limit Number of Downloads
python download_karaoke.py --limit 5
Override Resolution
python download_karaoke.py --resolution 1080p
Reset/Start Over for a Channel
python download_karaoke.py --reset-channel SingKingKaraoke
Reset Channel and Songlist Songs
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
Clear Channel Cache
python download_karaoke.py --clear-cache SingKingKaraoke
python download_karaoke.py --clear-cache all
🧠 Songlist Integration
- Place your prioritized song list in
data/songList.json(see example format below). - The tool will match and prioritize these songs across all available channel videos.
- Use
--songlist-onlyto download only these songs, or--songlist-priorityto prioritize them in the queue. - Use
--songlist-focusto download only songs from specific playlists by title (e.g.,--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"). - Download progress for the songlist is tracked globally in
data/songlist_tracking.json.
Example data/songList.json
[
{
"title": "2025 - Apple Top 50",
"songs": [
{ "artist": "Kendrick Lamar & SZA", "title": "luther", "position": 1 },
{ "artist": "Kendrick Lamar", "title": "Not Like Us", "position": 2 }
]
},
{
"title": "2024 - Billboard Hot 100",
"songs": [
{ "artist": "Taylor Swift", "title": "Cruel Summer", "position": 1 },
{ "artist": "Billie Eilish", "title": "Happier Than Ever", "position": 2 }
]
}
]
🛠️ Tracking & Caching
- data/karaoke_tracking.json: Tracks all downloads, statuses, and formats
- data/songlist_tracking.json: Tracks global songlist download progress
- data/server_duplicates_tracking.json: Tracks songs found to be duplicates on the server for future skipping
- data/channel_cache.json: Caches channel video lists for performance
📂 Folder Structure
KaroakeVideoDownloader/
├── commands.txt # Complete CLI commands reference (copy/paste ready)
├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── tracking_manager.py # Download tracking and status management
│ ├── download_planner.py # Download plan building and channel scanning
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── channel_manager.py # Channel and file management operations
│ ├── songlist_manager.py # Songlist operations and tracking
│ ├── server_manager.py # Server song availability checking
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── youtube_utils.py # Centralized YouTube operations and yt-dlp commands
│ ├── error_utils.py # Standardized error handling and formatting
│ ├── download_pipeline.py # Abstracted download → verify → tag → track pipeline
│ ├── id3_utils.py # ID3 tagging utilities
│ ├── config_manager.py # Configuration management with dataclasses
│ ├── file_utils.py # Centralized file operations and filename handling
│ ├── song_validator.py # Centralized song validation logic
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.txt
│ └── songList.json
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat # (optional Windows launcher)
🚦 CLI Options
📋 Complete Command Reference: See
commands.txtfor all available commands with examples - perfect for copy/paste!
Key Options:
--file <data/channels.txt>: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)--songlist-priority: Prioritize songlist songs in download queue--songlist-only: Download only songs from the songlist--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...: Focus on specific playlists by title (e.g.,--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")--songlist-file <FILE_PATH>: Custom songlist file path to use with --songlist-focus (default: data/songList.json)--songlist-status: Show songlist download progress--limit <N>: Limit number of downloads (enables fast mode with early exit)--resolution <720p|1080p|...>: Override resolution--status: Show download/tracking status--reset-channel <CHANNEL_NAME>: Reset all tracking and files for a channel--reset-songlist: When used with --reset-channel, also reset songlist songs for this channel--clear-cache <CHANNEL_ID|all>: Clear channel video cache for a specific channel or all--clear-server-duplicates: Clear server duplicates tracking (allows re-checking songs against server)--latest-per-channel: Download the latest N videos from each channel (use with --limit)--fuzzy-match: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)--fuzzy-threshold <N>: Fuzzy match threshold (0-100, default 85)--parallel: Enable parallel downloads for improved speed (defaults to 3 workers)--workers <N>: Number of parallel download workers (1-10, default: 3, only used with --parallel)--generate-songlist <DIR1> <DIR2>...: Generate song list from MP4 files with ID3 tags in specified directories--no-append-songlist: Create a new song list instead of appending when using --generate-songlist--force: Force download from channels, bypassing all existing file checks and re-downloading if necessary
📝 Example Usage
💡 For complete examples: See
commands.txtfor all command variations with explanations!
# Fast mode with fuzzy matching (no need to specify --file)
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
# Parallel downloads for faster processing
python download_karaoke.py --parallel --songlist-only --limit 10
# Latest videos per channel with parallel downloads
python download_karaoke.py --parallel --latest-per-channel --limit 5
# Traditional full scan (no limit)
python download_karaoke.py --songlist-only
# Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
# Focus on specific playlists from a custom file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --limit 10
# Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
# Channel-specific operations
python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates
# Song list generation from MP4 files
python download_karaoke.py --generate-songlist /path/to/mp4/directory
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
# Generate report of songs that couldn't be found
python download_karaoke.py --generate-unmatched-report
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
🏷️ ID3 Tagging
- Adds artist/title/album/genre to MP4 files using mutagen (if installed)
📋 Song List Generation
- Generate song lists from existing MP4 files: Use
--generate-songlistto create song lists from directories containing MP4 files with ID3 tags - Automatic ID3 extraction: Extracts artist and title from MP4 files' ID3 tags
- Directory-based organization: Each directory becomes a playlist with the directory name as the title
- Position tracking: Songs are numbered starting from 1 based on file order
- Append or replace: Choose to append to existing song list or create a new one with
--no-append-songlist - Multiple directories: Process multiple directories in a single command
🧹 Cleanup
- Removes
.info.jsonand.metafiles after download
🛠️ Configuration
- All options are in
data/config.json(format, resolution, metadata, etc.) - You can edit this file or use CLI flags to override
📋 Command Reference File
commands.txt contains a comprehensive list of all CLI commands with explanations. This file is designed for easy copy/paste usage and includes:
- All basic download commands
- Songlist operations
- Latest-per-channel downloads
- Cache and tracking management
- Reset and cleanup operations
- Advanced combinations
- Common workflows
- Troubleshooting commands
🔄 Maintenance Note: The
commands.txtfile should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
📚 Documentation Standards
Documentation Location
- All changes, refactoring, and improvements should be documented in the PRD.md and README.md files
- Do NOT create separate .md files for documenting changes, refactoring, or improvements
- Use the existing sections in PRD.md and README.md to track all project evolution
Where to Document Changes
- PRD.md: Technical details, architecture changes, bug fixes, and implementation specifics
- README.md: User-facing features, usage instructions, and high-level improvements
- CHANGELOG.md: Version-specific release notes and change summaries
Documentation Requirements
- All new features must be documented in both PRD.md and README.md
- All refactoring efforts must be documented in the appropriate sections
- All bug fixes must be documented with technical details
- Version numbers and dates should be clearly marked
- Benefits and improvements should be explicitly stated
Maintenance Responsibility
- Keep PRD.md and README.md synchronized with code changes
- Update documentation immediately when implementing new features
- Remove outdated information and consolidate related changes
- Ensure all CLI options and features are documented in both files
🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
New Utility Modules (v3.3)
-
file_utils.py: Centralized file operations, filename sanitization, and file validationsanitize_filename(): Create safe filenames from artist/titlegenerate_possible_filenames(): Generate filename patterns for different modescheck_file_exists_with_patterns(): Check for existing files using multiple patternsis_valid_mp4_file(): Validate MP4 files with header checkingcleanup_temp_files(): Remove temporary yt-dlp filesensure_directory_exists(): Safe directory creation
-
song_validator.py: Centralized song validation logicSongValidatorclass: Unified logic for checking if songs should be downloadedshould_skip_song(): Comprehensive validation with multiple criteriamark_song_failed(): Consistent failure trackinghandle_download_failure(): Standardized error handling
-
Enhanced
config_manager.py: Robust configuration management with dataclassesConfigManagerclass: Type-safe configuration loading and cachingDownloadSettings,FolderStructure,LoggingConfigdataclasses- Configuration validation and merging with defaults
- Dynamic resolution updates
Benefits Achieved
- Eliminated Code Duplication: ~150 lines of duplicate code removed across modules
- Centralized File Operations: Single source of truth for filename handling and file validation
- Unified Song Validation: Consistent logic for checking if songs should be downloaded
- Enhanced Type Safety: Comprehensive type hints across all new modules
- Improved Configuration Management: Structured configuration with validation and caching
- Better Error Handling: Consistent patterns via centralized utilities
- Enhanced Maintainability: Changes to file operations or song validation only require updates in one place
- Improved Testability: Modular components can be tested independently
- Better Developer Experience: Clear function signatures and comprehensive documentation
New Parallel Download System (v3.4)
- Parallel downloader module:
parallel_downloader.pyprovides thread-safe concurrent download management - Configurable concurrency: Use
--parallelto enable parallel downloads with 3 workers by default, or--parallel --workers Nfor custom worker count (1-10) - Thread-safe operations: All tracking, caching, and progress operations are thread-safe
- Real-time progress tracking: Shows active downloads, completion status, and overall progress
- Automatic retry mechanism: Failed downloads are automatically retried with reduced concurrency
- Backward compatibility: Sequential downloads remain the default when
--parallelis not used - Performance improvements: Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
- Integrated with all modes: Works with both songlist-across-channels and latest-per-channel download modes
Previous Improvements (v3.2)
- Centralized yt-dlp Command Generation: Standardized command building and execution across all download operations
- Enhanced Error Handling: Structured exception hierarchy with consistent error messages and formatting
- Abstracted Download Pipeline: Reusable download → verify → tag → track process for consistent processing
- Download plan pre-scan: Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
- Latest-per-channel plan: Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
- Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
- Deduplication across channels: Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
- Fuzzy matching: Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
- Default channel file: For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
- Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- Optimized scanning algorithm: High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
- Enhanced cache management: Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
- Robust download plan execution: Fixed index management in download plan execution to prevent errors during interrupted downloads.
🐞 Troubleshooting
- Ensure
yt-dlp.exeis in thedownloader/folder - Check
logs/for error details - Use
python -m karaoke_downloader.check_resolutionto verify video quality - If you see errors about ffmpeg/ffprobe, install ffmpeg and ensure it is in your PATH
- For best fuzzy matching, install rapidfuzz:
pip install rapidfuzz(otherwise falls back to slower, less accurate difflib)
Happy Karaoke! 🎤