KaraokeVideoDownloader/PRD.md

40 KiB
Raw Blame History

🎤 Karaoke Video Downloader PRD (v3.4.4)

Overview

A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using yt-dlp, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.


🏗️ Architecture

The codebase has been refactored into focused modules with centralized utilities:

Core Modules:

  • downloader.py: Main orchestrator and CLI interface
  • video_downloader.py: Core video download execution and orchestration
  • tracking_manager.py: Download tracking and status management
  • download_planner.py: Download plan building and channel scanning
  • cache_manager.py: Cache operations and file I/O management
  • channel_manager.py: Channel and file management operations
  • songlist_manager.py: Songlist operations and tracking
  • server_manager.py: Server song availability checking
  • fuzzy_matcher.py: Fuzzy matching logic and similarity functions

Utility Modules (v3.2):

  • youtube_utils.py: Centralized YouTube operations and yt-dlp command generation
  • error_utils.py: Standardized error handling and formatting
  • download_pipeline.py: Abstracted download → verify → tag → track pipeline
  • id3_utils.py: ID3 tagging utilities
  • config_manager.py: Configuration management
  • resolution_cli.py: Resolution checking utilities
  • tracking_cli.py: Tracking management CLI

New Utility Modules (v3.3):

  • file_utils.py: Centralized file operations, filename sanitization, and file validation
  • song_validator.py: Centralized song validation logic for checking if songs should be downloaded

Benefits of Enhanced Modular Architecture:

  • Single Responsibility: Each module has a focused purpose
  • Centralized Utilities: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
  • Reduced Duplication: Eliminated ~150 lines of code duplication across modules
  • Testability: Individual components can be tested separately
  • Maintainability: Easier to find and fix issues
  • Reusability: Components can be used independently
  • Robustness: Better error handling and interruption recovery
  • Consistency: Standardized error messages and processing pipelines
  • Type Safety: Comprehensive type hints across all new modules

📋 Goals

  • Download karaoke videos from YouTube channels or playlists.
  • Organize downloads by channel (or playlist) in subfolders.
  • Avoid re-downloading the same videos (robust tracking).
  • Prioritize and track a custom songlist across channels.
  • Allow flexible, user-friendly configuration.
  • Provide robust interruption handling and progress recovery.

🧑‍💻 Target Users

  • Karaoke DJs, home karaoke users, event hosts, or anyone needing offline karaoke video libraries.
  • Users comfortable with command-line tools.

⚙️ Platform & Stack

  • Platform: Windows, macOS
  • Interface: Command-line (CLI)
  • Tech Stack: Python 3.7+, yt-dlp (platform-specific binary), mutagen (for ID3 tagging)

📥 Input

  • YouTube channel or playlist URLs (e.g. https://www.youtube.com/@SingKingKaraoke/videos)
  • Optional: data/channels.txt file with multiple channel URLs (one per line) - now defaults to this file if not specified
  • Optional: data/songList.json for prioritized song downloads

Example Usage

python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --songlist-only --limit 5
python download_karaoke.py --latest-per-channel --limit 3
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache SingKingKaraoke

📤 Output

  • MP4 files in downloads/<ChannelName>/ subfolders
  • All videos tracked in data/karaoke_tracking.json
  • Songlist progress tracked in data/songlist_tracking.json
  • Logs in logs/

🛠️ Features

  • Channel-based downloads (with per-channel folders)
  • Robust JSON tracking (downloaded, partial, failed, etc.)
  • Batch saving and channel video caching for performance
  • Configurable download resolution and yt-dlp options (data/config.json)
  • Songlist integration: prioritize and track custom songlists
  • Songlist-only mode: download only songs from the songlist
  • Songlist focus mode: download only songs from specific playlists by title
  • Force download mode: bypass all existing file checks and re-download songs regardless of server duplicates or existing files
  • Global songlist tracking to avoid duplicates across channels
  • ID3 tagging for artist/title in MP4 files (mutagen)
  • Real-time progress and detailed logging
  • Automatic cleanup of extra yt-dlp files
  • Reset/clear channel tracking and files via CLI
  • Clear channel cache via CLI
  • Download plan pre-scan and caching: Before downloading, the tool pre-scans all channels for songlist matches, builds a download plan, and prints stats. The plan is cached for 1 day in data/download_plan_cache.json for fast resuming and reliability. Use --force-download-plan to force a refresh.
  • Latest-per-channel download: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
  • Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. If a download fails, it continues scanning until the limit is satisfied or all channels are exhausted.
  • Deduplication across channels: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates.
  • Fuzzy matching: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib.
  • Default channel file: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list.
  • Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
  • Optimized scanning performance: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching of large songlists and channels.
  • Centralized yt-dlp command generation: Standardized command building and execution across all download operations
  • Enhanced error handling: Structured exception hierarchy with consistent error messages and formatting
  • Abstracted download pipeline: Reusable download → verify → tag → track process for consistent processing
  • Reduced code duplication: Eliminated duplicate code across modules through centralized utilities
  • Centralized file operations: Single source of truth for filename sanitization, file validation, and path operations
  • Centralized song validation: Unified logic for checking if songs should be downloaded across all modules
  • Enhanced configuration management: Structured configuration with dataclasses, type safety, and validation
  • Manual video collection: Static video collection system for managing individual karaoke videos that don't belong to regular channels. Use --manual to download from data/manual_videos.json.
  • Channel-specific parsing rules: JSON-based configuration for parsing video titles from different YouTube channels, with support for various title formats and cleanup rules.

📂 Folder Structure

KaroakeVideoDownloader/
├── karaoke_downloader/         # All core Python code and utilities
│   ├── downloader.py           # Main orchestrator and CLI interface
│   ├── cli.py                  # CLI entry point
│   ├── video_downloader.py     # Core video download execution and orchestration
│   ├── tracking_manager.py     # Download tracking and status management
│   ├── download_planner.py     # Download plan building and channel scanning
│   ├── cache_manager.py        # Cache operations and file I/O management
│   ├── channel_manager.py      # Channel and file management operations
│   ├── songlist_manager.py     # Songlist operations and tracking
│   ├── server_manager.py       # Server song availability checking
│   ├── fuzzy_matcher.py        # Fuzzy matching logic and similarity functions
│   ├── youtube_utils.py        # Centralized YouTube operations and yt-dlp commands
│   ├── error_utils.py          # Standardized error handling and formatting
│   ├── download_pipeline.py    # Abstracted download → verify → tag → track pipeline
│   ├── id3_utils.py            # ID3 tagging utilities
│   ├── config_manager.py       # Configuration management with dataclasses
│   ├── file_utils.py           # Centralized file operations and filename handling
│   ├── song_validator.py       # Centralized song validation logic
│   ├── check_resolution.py     # Resolution checker utility
│   ├── resolution_cli.py       # Resolution config CLI
│   └── tracking_cli.py         # Tracking management CLI
├── config/                   # Configuration files
│   └── config.json          # Main configuration file
├── data/                     # All tracking, cache, and songlist files
│   ├── karaoke_tracking.json
│   ├── songlist_tracking.json
│   ├── channel_cache.json
│   ├── channels.json          # Channel configuration with parsing rules
│   ├── manual_videos.json     # Manual video collection
│   └── songList.json
├── utilities/                # Utility scripts and tools
│   ├── add_manual_video.py  # Manual video management
│   ├── build_cache_from_raw.py # Cache building utility
│   ├── cleanup_duplicate_files.py # File cleanup utilities
│   ├── cleanup_recent_tracking.py # Tracking cleanup utilities
│   ├── deduplicate_songlist_tracking.py # Data deduplication
│   ├── fix_artist_name_format.py # Data cleanup utilities
│   ├── fix_artist_name_format_simple.py
│   ├── fix_code_quality.py  # Development tools
│   ├── reset_and_redownload.py # Maintenance utilities
│   └── songlist_report.py   # Reporting utilities
├── downloads/                 # All video output
│   └── [ChannelName]/         # Per-channel folders
├── logs/                      # Download logs
├── downloader/yt-dlp.exe      # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos    # yt-dlp binary (macOS)
├── src/tests/                 # Test scripts
│   ├── test_macos.py         # macOS setup and functionality tests
│   └── test_platform.py      # Platform detection tests
├── download_karaoke.py        # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat       # (optional Windows launcher)

🚦 CLI Options (Summary)

  • --file <data/channels.txt>: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
  • --songlist-priority: Prioritize songlist songs in download queue
  • --songlist-only: Download only songs from the songlist
  • --songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...: Focus on specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")
  • --songlist-file <FILE_PATH>: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
  • --force: Force download from channels, bypassing all existing file checks and re-downloading if necessary
  • --songlist-status: Show songlist download progress
  • --limit <N>: Limit number of downloads (enables fast mode with early exit)
  • --resolution <720p|1080p|...>: Override resolution
  • --status: Show download/tracking status
  • --reset-channel <CHANNEL_NAME>: Reset all tracking and files for a channel
  • --reset-songlist: When used with --reset-channel, also reset songlist songs for this channel
  • --clear-cache <CHANNEL_ID|all>: Clear channel video cache for a specific channel or all
  • --force-download-plan: Force refresh the download plan cache (re-scan all channels for matches)
  • --latest-per-channel: Download the latest N videos from each channel (use with --limit)
  • --fuzzy-match: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
  • --fuzzy-threshold <N>: Fuzzy match threshold (0-100, default 85)
  • --parallel: Enable parallel downloads for improved speed
  • --workers <N>: Number of parallel download workers (1-10, default: 3, only used with --parallel)
  • --manual: Download from manual videos collection (data/manual_videos.json)
  • --channel-focus <CHANNEL_NAME>: Download from a specific channel by name (e.g., 'SingKingKaraoke')
  • --all-videos: Download all videos from channel (not just songlist matches), skipping existing files and songs in songs.json
  • --dry-run: Build download plan and show what would be downloaded without actually downloading anything

🧠 Logic Highlights

  • Tracking: All downloads, statuses, and formats are tracked in JSON files for reliability and deduplication.
  • Songlist: Loads and normalizes data/songList.json, matches against available videos, and prioritizes or restricts downloads accordingly.
  • Batch/Caching: Channel video lists are cached to minimize API calls; tracking is batch-saved for performance.
  • ID3 Tagging: Artist/title extracted from video title and embedded in MP4 files.
  • Cleanup: Extra files from yt-dlp (e.g., .info.json) are automatically removed after download.
  • Reset/Clear: Use --reset-channel to reset all tracking and files for a channel (optionally including songlist songs with --reset-songlist). Use --clear-cache to clear cached video lists for a channel or all channels.
  • Channel-Specific Parsing: Uses data/channels.json to define parsing rules for each YouTube channel, handling different video title formats (e.g., "Artist - Title", "Artist Title", "Title | Artist", etc.).
  • Manual Video Collection: Static video management system using data/manual_videos.json for individual karaoke videos that don't belong to regular channels. Accessible via --manual parameter.

🔧 Refactoring Improvements (v3.3)

The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:

New Utility Modules (v3.3)

  • file_utils.py: Centralized file operations, filename sanitization, and file validation

    • sanitize_filename(): Create safe filenames from artist/title
    • generate_possible_filenames(): Generate filename patterns for different modes
    • check_file_exists_with_patterns(): Check for existing files using multiple patterns
    • is_valid_mp4_file(): Validate MP4 files with header checking
    • cleanup_temp_files(): Remove temporary yt-dlp files
    • ensure_directory_exists(): Safe directory creation
  • song_validator.py: Centralized song validation logic

    • SongValidator class: Unified logic for checking if songs should be downloaded
    • should_skip_song(): Comprehensive validation with multiple criteria
    • mark_song_failed(): Consistent failure tracking
    • handle_download_failure(): Standardized error handling
  • Enhanced config_manager.py: Robust configuration management with dataclasses

    • ConfigManager class: Type-safe configuration loading and caching
    • DownloadSettings, FolderStructure, LoggingConfig dataclasses
    • Configuration validation and merging with defaults
    • Dynamic resolution updates

Benefits Achieved

  • Eliminated Code Duplication: ~150 lines of duplicate code removed across modules
  • Centralized File Operations: Single source of truth for filename handling and file validation
  • Unified Song Validation: Consistent logic for checking if songs should be downloaded
  • Enhanced Type Safety: Comprehensive type hints across all new modules
  • Improved Configuration Management: Structured configuration with validation and caching
  • Better Error Handling: Consistent patterns via centralized utilities
  • Enhanced Maintainability: Changes to file operations or song validation only require updates in one place
  • Improved Testability: Modular components can be tested independently
  • Better Developer Experience: Clear function signatures and comprehensive documentation

Previous Improvements (v3.2)

  • Centralized yt-dlp Command Generation: Standardized command building and execution across all download operations
  • Enhanced Error Handling: Structured exception hierarchy with consistent error messages and formatting
  • Abstracted Download Pipeline: Reusable download → verify → tag → track process for consistent processing
  • Download plan pre-scan: Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
  • Latest-per-channel plan: Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
  • Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
  • Deduplication across channels: Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
  • Fuzzy matching: Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
  • Default channel file: For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
  • Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
  • Optimized scanning algorithm: High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
  • Enhanced cache management: Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
  • Robust download plan execution: Fixed index management in download plan execution to prevent errors during interrupted downloads.

New Parallel Download System (v3.4)

  • Parallel downloader module: parallel_downloader.py provides thread-safe concurrent download management
  • Configurable concurrency: Use --parallel to enable parallel downloads with 3 workers by default, or --parallel --workers N for custom worker count (1-10)
  • Thread-safe operations: All tracking, caching, and progress operations are thread-safe
  • Real-time progress tracking: Shows active downloads, completion status, and overall progress
  • Automatic retry mechanism: Failed downloads are automatically retried with reduced concurrency
  • Backward compatibility: Sequential downloads remain the default when --parallel is not used
  • Performance improvements: Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
  • Integrated with all modes: Works with both songlist-across-channels and latest-per-channel download modes

🚀 Future Enhancements

  • Web UI for easier management
  • More advanced song matching (multi-language)
  • Download scheduling and retry logic
  • More granular status reporting
  • Parallel downloads for improved speed COMPLETED
  • Enhanced fuzzy matching with improved video title parsing COMPLETED
  • Consolidated extract_artist_title function COMPLETED
  • Duplicate file prevention and filename consistency COMPLETED
  • Unit tests for all modules
  • Integration tests for end-to-end workflows
  • Plugin system for custom file operations
  • Advanced configuration UI
  • Real-time download progress visualization

🔧 Recent Bug Fixes & Improvements (v3.4.1)

Enhanced Fuzzy Matching (v3.4.1)

  • Improved extract_artist_title function: Enhanced to handle multiple video title formats beyond simple "Artist - Title" patterns
    • "Title Karaoke | Artist Karaoke Version" format: Correctly parses titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
    • "Title Artist KARAOKE" format: Handles titles ending with "KARAOKE" and attempts to extract artist information
    • Fallback handling: Returns empty artist and full title for unparseable formats
  • Consolidated function usage: Removed duplicate extract_artist_title implementations across modules
    • Single source of truth: All modules now import from fuzzy_matcher.py
    • Consistent parsing: Eliminated inconsistencies between different parsing implementations
    • Better maintainability: Changes to parsing logic only need to be made in one place

Fixed Import Conflicts

  • Resolved import conflict in download_planner.py: Updated to use the enhanced extract_artist_title from fuzzy_matcher.py instead of the simpler version from id3_utils.py
  • Updated id3_utils.py: Now imports extract_artist_title from fuzzy_matcher.py for consistency

Enhanced --limit Parameter

  • Fixed limit application: The --limit parameter now correctly applies to the scanning phase, not just the download execution
  • Improved performance: When using --limit N, only the first N songs are scanned against channels, significantly reducing processing time for large songlists

Benefits of Recent Improvements

  • Better matching accuracy: Enhanced fuzzy matching can now handle a wider variety of video title formats commonly found on YouTube karaoke channels
  • Reduced false negatives: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
  • Consistent behavior: All parts of the system use the same parsing logic, eliminating edge cases where different modules would parse the same title differently
  • Improved performance: The --limit parameter now works as expected, providing faster processing for targeted downloads
  • Cleaner codebase: Eliminated duplicate code and import conflicts, making the system more maintainable

🔧 Recent Bug Fixes & Improvements (v3.4.2)

Duplicate File Prevention & Filename Consistency

  • Enhanced file existence checking: check_file_exists_with_patterns() now detects files with (2), (3), etc. suffixes that yt-dlp creates
  • Automatic duplicate prevention: Download pipeline skips downloads when files already exist (including duplicates)
  • Updated yt-dlp configuration: Set "nooverwrites": false to prevent yt-dlp from creating duplicate files with suffixes
  • Cleanup utility: data/cleanup_duplicate_files.py provides interactive cleanup of existing duplicate files
  • Filename vs ID3 tag consistency: Removed "(Karaoke Version)" suffix from ID3 tags to match filenames exactly
  • Unified parsing: Both filename generation and ID3 tagging use the same artist/title extraction logic

Benefits of Duplicate Prevention

  • No more duplicate files: Eliminates (2), (3) suffix files that waste disk space
  • Consistent metadata: Filename and ID3 tag use identical artist/title format
  • Efficient disk usage: Prevents unnecessary downloads of existing files
  • Clear file identification: Consistent naming across all file operations

🛠️ Maintenance

Regular Cleanup

  • Run the cleanup utility periodically to remove any duplicate files
  • Monitor downloads for any new duplicate creation (should be rare with fixes)

Configuration

  • Keep "nooverwrites": false in data/config.json
  • This prevents yt-dlp from creating duplicate files

Monitoring

  • Check logs for "⏭️ Skipping download - file already exists" messages
  • These indicate the duplicate prevention is working correctly

🔧 Recent Bug Fixes & Improvements (v3.4.3)

Manual Video Collection System

  • New --manual parameter: Simple access to manual video collection via python download_karaoke.py --manual --limit 5
  • Static video management: data/manual_videos.json stores individual karaoke videos that don't belong to regular channels
  • Helper script: add_manual_video.py provides easy management of manual video entries
  • Full integration: Manual videos work with all existing features (songlist matching, fuzzy matching, parallel downloads, etc.)
  • No yt-dlp dependency: Manual videos bypass YouTube API calls for video listing, using static data instead

Channel-Specific Parsing Rules

  • JSON-based configuration: data/channels.json replaces data/channels.txt with structured channel configuration
  • Parsing rules per channel: Each channel can define custom parsing rules for video titles
  • Multiple format support: Handles various title formats like "Artist - Title", "Artist Title", "Title | Artist", etc.
  • Suffix cleanup: Automatic removal of common karaoke-related suffixes
  • Multi-artist support: Parsing for titles with multiple artists separated by specific delimiters
  • Backward compatibility: Still supports legacy data/channels.txt format

Benefits of New Features

  • Flexible video management: Easy addition of individual karaoke videos without creating new channels
  • Accurate parsing: Channel-specific rules ensure correct artist/title extraction for ID3 tags and filenames
  • Consistent metadata: Proper parsing prevents filename and ID3 tag inconsistencies
  • Easy maintenance: Simple JSON structure for managing both channels and manual videos
  • Full feature compatibility: Manual videos work seamlessly with existing download modes and features

📚 Documentation Standards

Documentation Location

  • All changes, refactoring, and improvements should be documented in the PRD.md and README.md files
  • Do NOT create separate .md files for documenting changes, refactoring, or improvements
  • Use the existing sections in PRD.md and README.md to track all project evolution

Where to Document Changes

  • PRD.md: Technical details, architecture changes, bug fixes, and implementation specifics
  • README.md: User-facing features, usage instructions, and high-level improvements
  • CHANGELOG.md: Version-specific release notes and change summaries

Documentation Requirements

  • All new features must be documented in both PRD.md and README.md
  • All refactoring efforts must be documented in the appropriate sections
  • All bug fixes must be documented with technical details
  • Version numbers and dates should be clearly marked
  • Benefits and improvements should be explicitly stated

Maintenance Responsibility

  • Keep PRD.md and README.md synchronized with code changes
  • Update documentation immediately when implementing new features
  • Remove outdated information and consolidate related changes
  • Ensure all CLI options and features are documented in both files

🔧 Recent Bug Fixes & Improvements (v3.4.4)

All Videos Download Mode

  • New --all-videos parameter: Download all videos from a channel, not just songlist matches
  • Smart MP3/MP4 detection: Automatically detects if you have MP3 versions in songs.json and downloads MP4 video versions
  • Existing file skipping: Skips videos that already exist on the filesystem
  • Progress tracking: Shows clear progress with "Downloading X/Y videos" format
  • Parallel processing support: Works with --parallel --workers N for faster downloads
  • Channel focus integration: Works with --channel-focus to target specific channels
  • Limit support: Works with --limit N to control download batch size

Smart Songlist Integration

  • MP4 version detection: Checks if MP4 version already exists in songs.json before downloading
  • MP3 upgrade path: Downloads MP4 video versions when only MP3 versions exist in songlist
  • Duplicate prevention: Skips downloads when MP4 versions already exist
  • Efficient filtering: Only processes videos that need to be downloaded

Benefits of All Videos Mode

  • Complete channel downloads: Download entire channels without songlist restrictions
  • Automatic format upgrading: Upgrade MP3 collections to MP4 video versions
  • Efficient processing: Only downloads videos that don't already exist
  • Flexible control: Use with limits, parallel processing, and channel targeting
  • Clear progress feedback: Real-time progress tracking for large downloads

🔧 Recent Bug Fixes & Improvements (v3.4.5)

Unified Download Workflow Architecture

  • Unified execution pipeline: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
  • Consistent behavior: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
  • Centralized download logic: Single execute_unified_download_workflow() method handles all download execution
  • Automatic parallel support: All download modes automatically support --parallel --workers N without additional implementation
  • Unified cache management: Consistent progress tracking and resume functionality across all modes

Architecture Pattern for New Download Modes

When adding new download modes in the future, follow this pattern to ensure consistency:

1. Download Plan Building (Mode-Specific)

Each download mode should build a download plan (list of videos to download) with this structure:

download_plan = [
    {
        "video_id": "video_id",
        "artist": "artist_name", 
        "title": "song_title",
        "filename": "sanitized_filename.mp4",
        "channel_name": "channel_name",
        "video_title": "original_video_title",
        "force_download": False
    }
]

2. Unified Execution (Shared)

All modes should use the unified execution workflow:

downloaded_count, success = self.execute_unified_download_workflow(
    download_plan=download_plan,
    cache_file=cache_file,  # Optional, for progress tracking
    limit=limit,            # Optional, for limiting downloads
    show_progress=True,     # Optional, for progress display
)

3. Execution Method Selection (Automatic)

The unified workflow automatically chooses execution method based on settings:

  • Sequential: Uses DownloadPipeline for single-threaded downloads
  • Parallel: Uses ParallelDownloader when --parallel is enabled

4. Required Implementation Pattern

def download_new_mode(self, ...):
    """New download mode implementation."""
    
    # 1. Build download plan (mode-specific logic)
    download_plan = []
    for video in videos_to_download:
        download_plan.append({
            "video_id": video["id"],
            "artist": artist,
            "title": title,
            "filename": filename,
            "channel_name": channel_name,
            "video_title": video["title"],
            "force_download": force_download
        })
    
    # 2. Create cache file (optional, for progress tracking)
    cache_file = get_download_plan_cache_file("new_mode", **plan_kwargs)
    save_plan_cache(cache_file, download_plan, [])
    
    # 3. Use unified execution workflow
    downloaded_count, success = self.execute_unified_download_workflow(
        download_plan=download_plan,
        cache_file=cache_file,
        limit=limit,
        show_progress=True,
    )
    
    return success

Benefits of Unified Architecture

  • Consistency: All modes behave identically for execution, progress tracking, and error handling
  • Maintainability: Changes to download execution only need to be made in one place
  • Reliability: Eliminates broken pipelines and inconsistent behavior between modes
  • Extensibility: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
  • Testing: Easier to test since all modes use the same execution logic

What Was Fixed

  • Broken Pipeline: Previously, different modes used different execution paths, leading to inconsistencies
  • Missing Method: Added missing download_latest_per_channel() method that was referenced in CLI but not implemented
  • Code Duplication: Eliminated duplicate download execution logic across different modes
  • Inconsistent Behavior: All modes now have identical progress tracking, error handling, and cache management

Future Development Guidelines

  1. NEVER implement custom download execution logic in new download modes
  2. ALWAYS use execute_unified_download_workflow() for download execution
  3. Focus on download plan building - that's where mode-specific logic belongs
  4. Use the standard download plan structure for consistency
  5. Implement cache file handling for progress tracking and resume functionality
  6. Test with both sequential and parallel modes to ensure compatibility

🚀 Future Enhancements

  • Web UI for easier management
  • More advanced song matching (multi-language)
  • Download scheduling and retry logic
  • More granular status reporting
  • Parallel downloads for improved speed COMPLETED
  • Enhanced fuzzy matching with improved video title parsing COMPLETED
  • Consolidated extract_artist_title function COMPLETED
  • Duplicate file prevention and filename consistency COMPLETED
  • Unit tests for all modules
  • Integration tests for end-to-end workflows
  • Plugin system for custom file operations
  • Advanced configuration UI
  • Real-time download progress visualization

🔧 Recent Bug Fixes & Improvements (v3.4.4)

macOS Support with Automatic Platform Detection

  • Cross-platform compatibility: Added support for macOS alongside Windows
  • Automatic platform detection: Detects operating system and selects appropriate yt-dlp binary
  • Flexible yt-dlp integration: Supports both binary files (yt-dlp_macos) and pip installation (python3 -m yt_dlp)
  • Setup automation: setup_macos.py script for easy macOS setup with FFmpeg and yt-dlp installation
  • Command parsing: Intelligent parsing of yt-dlp commands (file paths vs. module commands)
  • Enhanced validation: Platform-specific error messages and validation in CLI
  • Backward compatibility: Maintains full compatibility with existing Windows installations

Benefits of macOS Support

  • Native macOS experience: No need for Windows compatibility layers or virtualization
  • Automatic setup: Simple setup script handles all dependencies
  • Flexible installation: Choose between binary download or pip installation
  • Consistent functionality: All features work identically on both platforms
  • Easy maintenance: Platform detection handles configuration automatically

Setup Instructions

# Automatic setup (recommended)
python3 setup_macos.py

# Test installation
python3 src/tests/test_macos.py

# Manual setup options
# 1. Install yt-dlp via pip: pip3 install yt-dlp
# 2. Download binary: curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
# 3. Install FFmpeg: brew install ffmpeg

🔧 Recent Bug Fixes & Improvements (v3.4.7)

Configurable Data Directory Path

  • Centralized Data Path Management: New data_path_manager.py module provides unified data directory path management
  • Configurable Location: Data directory path can be set in config/config.json under folder_structure.data_dir
  • Backward Compatibility: Defaults to "data" directory if not configured
  • Cross-Project Integration: Enables the karaoke downloader to be used as a component in other projects with different data directory structures
  • Updated All Modules: All modules now use the data path manager instead of hardcoded "data/" paths
  • Utility Functions: Provides get_data_path(), get_data_dir(), and get_data_path_manager() functions for easy access
  • Fixed Circular Dependency: Moved config.json from data/ to root directory to resolve chicken-and-egg problem

Benefits of Configurable Data Directory

  • Flexible Deployment: Can be integrated into other projects with different directory structures
  • Centralized Configuration: Single point of configuration for all data file paths
  • Maintainable Code: Eliminates hardcoded paths throughout the codebase
  • Easy Testing: Can use temporary directories for testing without affecting production data
  • Future-Proof: Makes it easier to change data directory structure in the future

Circular Dependency Solution

The original implementation had a circular dependency problem:

  • Problem: config.json was located in the data/ directory
  • Issue: To read the config file, we needed to know where the data directory is
  • Conflict: But the data directory location is specified in the config file
  • Solution: Moved config.json to the config/ directory as a fixed location
  • Result: Config file is always accessible in a dedicated config directory, and data directory can be configured within it
  • Backward Compatibility: System still works with config files in custom data directories when explicitly specified

🔧 Recent Bug Fixes & Improvements (v3.4.6)

Dry Run Mode

  • New --dry-run parameter: Build download plan and show what would be downloaded without actually downloading anything
  • Plan preview: Shows total videos in plan and preview of first 5 videos
  • Safe testing: Test download configurations without consuming bandwidth or disk space
  • All mode support: Works with all download modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel)
  • Progress simulation: Shows what the download process would look like without executing it

Benefits of Dry Run Mode

  • Safe testing: Test complex download configurations without downloading anything
  • Plan validation: Verify that the download plan contains the expected videos
  • Configuration debugging: Troubleshoot download settings before committing to downloads
  • Resource conservation: Save bandwidth and disk space during testing
  • User education: Help users understand what the tool will do before running it

Example Usage

# Test songlist download plan
python download_karaoke.py --songlist-only --limit 5 --dry-run

# Test channel download plan
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run

# Test with fuzzy matching
python download_karaoke.py --songlist-only --fuzzy-match --limit 3 --dry-run

Future Development Guidelines