KaraokeVideoDownloader/PRD.md

27 KiB
Raw Blame History

🎤 Karaoke Video Downloader PRD (v3.4.3)

Overview

A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using yt-dlp.exe, with advanced tracking, songlist prioritization, and flexible configuration. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.


🏗️ Architecture

The codebase has been refactored into focused modules with centralized utilities:

Core Modules:

  • downloader.py: Main orchestrator and CLI interface
  • video_downloader.py: Core video download execution and orchestration
  • tracking_manager.py: Download tracking and status management
  • download_planner.py: Download plan building and channel scanning
  • cache_manager.py: Cache operations and file I/O management
  • channel_manager.py: Channel and file management operations
  • songlist_manager.py: Songlist operations and tracking
  • server_manager.py: Server song availability checking
  • fuzzy_matcher.py: Fuzzy matching logic and similarity functions

Utility Modules (v3.2):

  • youtube_utils.py: Centralized YouTube operations and yt-dlp command generation
  • error_utils.py: Standardized error handling and formatting
  • download_pipeline.py: Abstracted download → verify → tag → track pipeline
  • id3_utils.py: ID3 tagging utilities
  • config_manager.py: Configuration management
  • resolution_cli.py: Resolution checking utilities
  • tracking_cli.py: Tracking management CLI

New Utility Modules (v3.3):

  • file_utils.py: Centralized file operations, filename sanitization, and file validation
  • song_validator.py: Centralized song validation logic for checking if songs should be downloaded

Benefits of Enhanced Modular Architecture:

  • Single Responsibility: Each module has a focused purpose
  • Centralized Utilities: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
  • Reduced Duplication: Eliminated ~150 lines of code duplication across modules
  • Testability: Individual components can be tested separately
  • Maintainability: Easier to find and fix issues
  • Reusability: Components can be used independently
  • Robustness: Better error handling and interruption recovery
  • Consistency: Standardized error messages and processing pipelines
  • Type Safety: Comprehensive type hints across all new modules

📋 Goals

  • Download karaoke videos from YouTube channels or playlists.
  • Organize downloads by channel (or playlist) in subfolders.
  • Avoid re-downloading the same videos (robust tracking).
  • Prioritize and track a custom songlist across channels.
  • Allow flexible, user-friendly configuration.
  • Provide robust interruption handling and progress recovery.

🧑‍💻 Target Users

  • Karaoke DJs, home karaoke users, event hosts, or anyone needing offline karaoke video libraries.
  • Users comfortable with command-line tools.

⚙️ Platform & Stack

  • Platform: Windows
  • Interface: Command-line (CLI)
  • Tech Stack: Python 3.7+, yt-dlp.exe, mutagen (for ID3 tagging)

📥 Input

  • YouTube channel or playlist URLs (e.g. https://www.youtube.com/@SingKingKaraoke/videos)
  • Optional: data/channels.txt file with multiple channel URLs (one per line) - now defaults to this file if not specified
  • Optional: data/songList.json for prioritized song downloads

Example Usage

python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --songlist-only --limit 5
python download_karaoke.py --latest-per-channel --limit 3
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache SingKingKaraoke

📤 Output

  • MP4 files in downloads/<ChannelName>/ subfolders
  • All videos tracked in data/karaoke_tracking.json
  • Songlist progress tracked in data/songlist_tracking.json
  • Logs in logs/

🛠️ Features

  • Channel-based downloads (with per-channel folders)
  • Robust JSON tracking (downloaded, partial, failed, etc.)
  • Batch saving and channel video caching for performance
  • Configurable download resolution and yt-dlp options (data/config.json)
  • Songlist integration: prioritize and track custom songlists
  • Songlist-only mode: download only songs from the songlist
  • Songlist focus mode: download only songs from specific playlists by title
  • Force download mode: bypass all existing file checks and re-download songs regardless of server duplicates or existing files
  • Global songlist tracking to avoid duplicates across channels
  • ID3 tagging for artist/title in MP4 files (mutagen)
  • Real-time progress and detailed logging
  • Automatic cleanup of extra yt-dlp files
  • Reset/clear channel tracking and files via CLI
  • Clear channel cache via CLI
  • Download plan pre-scan and caching: Before downloading, the tool pre-scans all channels for songlist matches, builds a download plan, and prints stats. The plan is cached for 1 day in data/download_plan_cache.json for fast resuming and reliability. Use --force-download-plan to force a refresh.
  • Latest-per-channel download: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
  • Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. If a download fails, it continues scanning until the limit is satisfied or all channels are exhausted.
  • Deduplication across channels: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates.
  • Fuzzy matching: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib.
  • Default channel file: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list.
  • Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
  • Optimized scanning performance: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching of large songlists and channels.
  • Centralized yt-dlp command generation: Standardized command building and execution across all download operations
  • Enhanced error handling: Structured exception hierarchy with consistent error messages and formatting
  • Abstracted download pipeline: Reusable download → verify → tag → track process for consistent processing
  • Reduced code duplication: Eliminated duplicate code across modules through centralized utilities
  • Centralized file operations: Single source of truth for filename sanitization, file validation, and path operations
  • Centralized song validation: Unified logic for checking if songs should be downloaded across all modules
  • Enhanced configuration management: Structured configuration with dataclasses, type safety, and validation
  • Manual video collection: Static video collection system for managing individual karaoke videos that don't belong to regular channels. Use --manual to download from data/manual_videos.json.
  • Channel-specific parsing rules: JSON-based configuration for parsing video titles from different YouTube channels, with support for various title formats and cleanup rules.

📂 Folder Structure

KaroakeVideoDownloader/
├── karaoke_downloader/         # All core Python code and utilities
│   ├── downloader.py           # Main orchestrator and CLI interface
│   ├── cli.py                  # CLI entry point
│   ├── video_downloader.py     # Core video download execution and orchestration
│   ├── tracking_manager.py     # Download tracking and status management
│   ├── download_planner.py     # Download plan building and channel scanning
│   ├── cache_manager.py        # Cache operations and file I/O management
│   ├── channel_manager.py      # Channel and file management operations
│   ├── songlist_manager.py     # Songlist operations and tracking
│   ├── server_manager.py       # Server song availability checking
│   ├── fuzzy_matcher.py        # Fuzzy matching logic and similarity functions
│   ├── youtube_utils.py        # Centralized YouTube operations and yt-dlp commands
│   ├── error_utils.py          # Standardized error handling and formatting
│   ├── download_pipeline.py    # Abstracted download → verify → tag → track pipeline
│   ├── id3_utils.py            # ID3 tagging utilities
│   ├── config_manager.py       # Configuration management with dataclasses
│   ├── file_utils.py           # Centralized file operations and filename handling
│   ├── song_validator.py       # Centralized song validation logic
│   ├── check_resolution.py     # Resolution checker utility
│   ├── resolution_cli.py       # Resolution config CLI
│   └── tracking_cli.py         # Tracking management CLI
├── data/                      # All config, tracking, cache, and songlist files
│   ├── config.json
│   ├── karaoke_tracking.json
│   ├── songlist_tracking.json
│   ├── channel_cache.json
│   ├── channels.json          # Channel configuration with parsing rules
│   ├── channels.txt           # Legacy channel list (backward compatibility)
│   ├── manual_videos.json     # Manual video collection
│   └── songList.json
├── downloads/                 # All video output
│   └── [ChannelName]/         # Per-channel folders
├── logs/                      # Download logs
├── downloader/yt-dlp.exe      # yt-dlp binary
├── tests/                     # Diagnostic and test scripts
│   └── test_installation.py
├── download_karaoke.py        # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat       # (optional Windows launcher)

🚦 CLI Options (Summary)

  • --file <data/channels.txt>: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
  • --songlist-priority: Prioritize songlist songs in download queue
  • --songlist-only: Download only songs from the songlist
  • --songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...: Focus on specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")
  • --songlist-file <FILE_PATH>: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
  • --force: Force download from channels, bypassing all existing file checks and re-downloading if necessary
  • --songlist-status: Show songlist download progress
  • --limit <N>: Limit number of downloads (enables fast mode with early exit)
  • --resolution <720p|1080p|...>: Override resolution
  • --status: Show download/tracking status
  • --reset-channel <CHANNEL_NAME>: Reset all tracking and files for a channel
  • --reset-songlist: When used with --reset-channel, also reset songlist songs for this channel
  • --clear-cache <CHANNEL_ID|all>: Clear channel video cache for a specific channel or all
  • --force-download-plan: Force refresh the download plan cache (re-scan all channels for matches)
  • --latest-per-channel: Download the latest N videos from each channel (use with --limit)
  • --fuzzy-match: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
  • --fuzzy-threshold <N>: Fuzzy match threshold (0-100, default 85)
  • --parallel: Enable parallel downloads for improved speed
  • --workers <N>: Number of parallel download workers (1-10, default: 3, only used with --parallel)
  • --manual: Download from manual videos collection (data/manual_videos.json)

🧠 Logic Highlights

  • Tracking: All downloads, statuses, and formats are tracked in JSON files for reliability and deduplication.
  • Songlist: Loads and normalizes data/songList.json, matches against available videos, and prioritizes or restricts downloads accordingly.
  • Batch/Caching: Channel video lists are cached to minimize API calls; tracking is batch-saved for performance.
  • ID3 Tagging: Artist/title extracted from video title and embedded in MP4 files.
  • Cleanup: Extra files from yt-dlp (e.g., .info.json) are automatically removed after download.
  • Reset/Clear: Use --reset-channel to reset all tracking and files for a channel (optionally including songlist songs with --reset-songlist). Use --clear-cache to clear cached video lists for a channel or all channels.
  • Channel-Specific Parsing: Uses data/channels.json to define parsing rules for each YouTube channel, handling different video title formats (e.g., "Artist - Title", "Artist Title", "Title | Artist", etc.).
  • Manual Video Collection: Static video management system using data/manual_videos.json for individual karaoke videos that don't belong to regular channels. Accessible via --manual parameter.

🔧 Refactoring Improvements (v3.3)

The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:

New Utility Modules (v3.3)

  • file_utils.py: Centralized file operations, filename sanitization, and file validation

    • sanitize_filename(): Create safe filenames from artist/title
    • generate_possible_filenames(): Generate filename patterns for different modes
    • check_file_exists_with_patterns(): Check for existing files using multiple patterns
    • is_valid_mp4_file(): Validate MP4 files with header checking
    • cleanup_temp_files(): Remove temporary yt-dlp files
    • ensure_directory_exists(): Safe directory creation
  • song_validator.py: Centralized song validation logic

    • SongValidator class: Unified logic for checking if songs should be downloaded
    • should_skip_song(): Comprehensive validation with multiple criteria
    • mark_song_failed(): Consistent failure tracking
    • handle_download_failure(): Standardized error handling
  • Enhanced config_manager.py: Robust configuration management with dataclasses

    • ConfigManager class: Type-safe configuration loading and caching
    • DownloadSettings, FolderStructure, LoggingConfig dataclasses
    • Configuration validation and merging with defaults
    • Dynamic resolution updates

Benefits Achieved

  • Eliminated Code Duplication: ~150 lines of duplicate code removed across modules
  • Centralized File Operations: Single source of truth for filename handling and file validation
  • Unified Song Validation: Consistent logic for checking if songs should be downloaded
  • Enhanced Type Safety: Comprehensive type hints across all new modules
  • Improved Configuration Management: Structured configuration with validation and caching
  • Better Error Handling: Consistent patterns via centralized utilities
  • Enhanced Maintainability: Changes to file operations or song validation only require updates in one place
  • Improved Testability: Modular components can be tested independently
  • Better Developer Experience: Clear function signatures and comprehensive documentation

Previous Improvements (v3.2)

  • Centralized yt-dlp Command Generation: Standardized command building and execution across all download operations
  • Enhanced Error Handling: Structured exception hierarchy with consistent error messages and formatting
  • Abstracted Download Pipeline: Reusable download → verify → tag → track process for consistent processing
  • Download plan pre-scan: Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
  • Latest-per-channel plan: Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
  • Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
  • Deduplication across channels: Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
  • Fuzzy matching: Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
  • Default channel file: For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
  • Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
  • Optimized scanning algorithm: High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
  • Enhanced cache management: Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
  • Robust download plan execution: Fixed index management in download plan execution to prevent errors during interrupted downloads.

New Parallel Download System (v3.4)

  • Parallel downloader module: parallel_downloader.py provides thread-safe concurrent download management
  • Configurable concurrency: Use --parallel to enable parallel downloads with 3 workers by default, or --parallel --workers N for custom worker count (1-10)
  • Thread-safe operations: All tracking, caching, and progress operations are thread-safe
  • Real-time progress tracking: Shows active downloads, completion status, and overall progress
  • Automatic retry mechanism: Failed downloads are automatically retried with reduced concurrency
  • Backward compatibility: Sequential downloads remain the default when --parallel is not used
  • Performance improvements: Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
  • Integrated with all modes: Works with both songlist-across-channels and latest-per-channel download modes

🚀 Future Enhancements

  • Web UI for easier management
  • More advanced song matching (multi-language)
  • Download scheduling and retry logic
  • More granular status reporting
  • Parallel downloads for improved speed COMPLETED
  • Enhanced fuzzy matching with improved video title parsing COMPLETED
  • Consolidated extract_artist_title function COMPLETED
  • Duplicate file prevention and filename consistency COMPLETED
  • Unit tests for all modules
  • Integration tests for end-to-end workflows
  • Plugin system for custom file operations
  • Advanced configuration UI
  • Real-time download progress visualization

🔧 Recent Bug Fixes & Improvements (v3.4.1)

Enhanced Fuzzy Matching (v3.4.1)

  • Improved extract_artist_title function: Enhanced to handle multiple video title formats beyond simple "Artist - Title" patterns
    • "Title Karaoke | Artist Karaoke Version" format: Correctly parses titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
    • "Title Artist KARAOKE" format: Handles titles ending with "KARAOKE" and attempts to extract artist information
    • Fallback handling: Returns empty artist and full title for unparseable formats
  • Consolidated function usage: Removed duplicate extract_artist_title implementations across modules
    • Single source of truth: All modules now import from fuzzy_matcher.py
    • Consistent parsing: Eliminated inconsistencies between different parsing implementations
    • Better maintainability: Changes to parsing logic only need to be made in one place

Fixed Import Conflicts

  • Resolved import conflict in download_planner.py: Updated to use the enhanced extract_artist_title from fuzzy_matcher.py instead of the simpler version from id3_utils.py
  • Updated id3_utils.py: Now imports extract_artist_title from fuzzy_matcher.py for consistency

Enhanced --limit Parameter

  • Fixed limit application: The --limit parameter now correctly applies to the scanning phase, not just the download execution
  • Improved performance: When using --limit N, only the first N songs are scanned against channels, significantly reducing processing time for large songlists

Benefits of Recent Improvements

  • Better matching accuracy: Enhanced fuzzy matching can now handle a wider variety of video title formats commonly found on YouTube karaoke channels
  • Reduced false negatives: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
  • Consistent behavior: All parts of the system use the same parsing logic, eliminating edge cases where different modules would parse the same title differently
  • Improved performance: The --limit parameter now works as expected, providing faster processing for targeted downloads
  • Cleaner codebase: Eliminated duplicate code and import conflicts, making the system more maintainable

🔧 Recent Bug Fixes & Improvements (v3.4.2)

Duplicate File Prevention & Filename Consistency

  • Enhanced file existence checking: check_file_exists_with_patterns() now detects files with (2), (3), etc. suffixes that yt-dlp creates
  • Automatic duplicate prevention: Download pipeline skips downloads when files already exist (including duplicates)
  • Updated yt-dlp configuration: Set "nooverwrites": false to prevent yt-dlp from creating duplicate files with suffixes
  • Cleanup utility: data/cleanup_duplicate_files.py provides interactive cleanup of existing duplicate files
  • Filename vs ID3 tag consistency: Removed "(Karaoke Version)" suffix from ID3 tags to match filenames exactly
  • Unified parsing: Both filename generation and ID3 tagging use the same artist/title extraction logic

Benefits of Duplicate Prevention

  • No more duplicate files: Eliminates (2), (3) suffix files that waste disk space
  • Consistent metadata: Filename and ID3 tag use identical artist/title format
  • Efficient disk usage: Prevents unnecessary downloads of existing files
  • Clear file identification: Consistent naming across all file operations

🛠️ Maintenance

Regular Cleanup

  • Run the cleanup utility periodically to remove any duplicate files
  • Monitor downloads for any new duplicate creation (should be rare with fixes)

Configuration

  • Keep "nooverwrites": false in data/config.json
  • This prevents yt-dlp from creating duplicate files

Monitoring

  • Check logs for "⏭️ Skipping download - file already exists" messages
  • These indicate the duplicate prevention is working correctly

🔧 Recent Bug Fixes & Improvements (v3.4.3)

Manual Video Collection System

  • New --manual parameter: Simple access to manual video collection via python download_karaoke.py --manual --limit 5
  • Static video management: data/manual_videos.json stores individual karaoke videos that don't belong to regular channels
  • Helper script: add_manual_video.py provides easy management of manual video entries
  • Full integration: Manual videos work with all existing features (songlist matching, fuzzy matching, parallel downloads, etc.)
  • No yt-dlp dependency: Manual videos bypass YouTube API calls for video listing, using static data instead

Channel-Specific Parsing Rules

  • JSON-based configuration: data/channels.json replaces data/channels.txt with structured channel configuration
  • Parsing rules per channel: Each channel can define custom parsing rules for video titles
  • Multiple format support: Handles various title formats like "Artist - Title", "Artist Title", "Title | Artist", etc.
  • Suffix cleanup: Automatic removal of common karaoke-related suffixes
  • Multi-artist support: Parsing for titles with multiple artists separated by specific delimiters
  • Backward compatibility: Still supports legacy data/channels.txt format

Benefits of New Features

  • Flexible video management: Easy addition of individual karaoke videos without creating new channels
  • Accurate parsing: Channel-specific rules ensure correct artist/title extraction for ID3 tags and filenames
  • Consistent metadata: Proper parsing prevents filename and ID3 tag inconsistencies
  • Easy maintenance: Simple JSON structure for managing both channels and manual videos
  • Full feature compatibility: Manual videos work seamlessly with existing download modes and features

📚 Documentation Standards

Documentation Location

  • All changes, refactoring, and improvements should be documented in the PRD.md and README.md files
  • Do NOT create separate .md files for documenting changes, refactoring, or improvements
  • Use the existing sections in PRD.md and README.md to track all project evolution

Where to Document Changes

  • PRD.md: Technical details, architecture changes, bug fixes, and implementation specifics
  • README.md: User-facing features, usage instructions, and high-level improvements
  • CHANGELOG.md: Version-specific release notes and change summaries

Documentation Requirements

  • All new features must be documented in both PRD.md and README.md
  • All refactoring efforts must be documented in the appropriate sections
  • All bug fixes must be documented with technical details
  • Version numbers and dates should be clearly marked
  • Benefits and improvements should be explicitly stated

Maintenance Responsibility

  • Keep PRD.md and README.md synchronized with code changes
  • Update documentation immediately when implementing new features
  • Remove outdated information and consolidate related changes
  • Ensure all CLI options and features are documented in both files