Go to file
2025-08-10 10:28:29 -05:00
data Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
downloader mac support 2025-08-05 16:11:29 -05:00
karaoke_downloader Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
.flake8 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-25 13:57:52 -05:00
.gitignore Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-08-05 16:30:20 -05:00
add_manual_video.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
build_cache_from_raw.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
commands.txt mac support 2025-08-05 16:11:29 -05:00
config.json Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
download_karaoke.bat Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-23 22:02:30 -05:00
download_karaoke.py Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-24 08:17:41 -05:00
example_custom_data_directory.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
fix_artist_name_format_simple.py Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 14:09:07 -05:00
fix_artist_name_format.py Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 14:09:07 -05:00
fix_code_quality.py Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-25 13:57:52 -05:00
PRD.md Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
pyproject.toml Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-25 13:57:52 -05:00
README.md Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
requirements.txt Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-23 22:02:30 -05:00
reset_and_redownload.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
setup_macos.py mac support 2025-08-05 16:11:29 -05:00

🎤 Karaoke Video Downloader

A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using yt-dlp, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection.

Features

  • 🎵 Channel & Playlist Downloads: Download all videos from a YouTube channel or playlist
  • 📂 Organized Storage: Each channel gets its own folder in downloads/
  • 📝 Robust Tracking: Tracks all downloads, statuses, and formats in JSON
  • 🏆 Songlist Prioritization: Prioritize or restrict downloads to a custom songlist
  • 🔄 Batch Saving & Caching: Efficient, minimizes API calls
  • 🏷️ ID3 Tagging: Adds artist/title metadata to MP4 files
  • 🧹 Automatic Cleanup: Removes extra yt-dlp files
  • 📈 Real-Time Progress: Detailed console and log output
  • 🧹 Reset/Clear Channel: Reset all tracking and files for a channel, or clear channel cache via CLI
  • 🗂️ Latest-per-channel download: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
  • 🧩 Enhanced Fuzzy Matching: Advanced fuzzy string matching for songlist-to-video matching with improved video title parsing (handles multiple title formats like "Title Karaoke | Artist Karaoke Version")
  • Fast Mode with Early Exit: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
  • 🔄 Deduplication Across Channels: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
  • 📋 Default Channel File: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
  • 🛡️ Robust Interruption Handling: Progress is saved after each download, preventing re-downloads if the process is interrupted
  • Optimized Scanning: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
  • 🏷️ Server Duplicates Tracking: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
  • Parallel Downloads: Enable concurrent downloads with --parallel --workers N for significantly faster batch downloads (3-5x speedup)
  • 📊 Unmatched Songs Reports: Generate detailed reports of songs that couldn't be found in any channel with --generate-unmatched-report
  • 🛡️ Duplicate File Prevention: Automatically detects and prevents duplicate files with (2), (3) suffixes, with cleanup utility for existing duplicates
  • 🏷️ Consistent Metadata: Filename and ID3 tag use identical artist/title format for clear file identification
  • 🍎 macOS Support: Automatic platform detection and setup with native macOS binaries and FFmpeg integration

🏗️ Architecture

The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:

Configurable Data Directory (v3.4.7)

  • Centralized Data Path Management: data_path_manager.py provides unified data directory path management
  • Configurable Location: Data directory path can be set in config.json under folder_structure.data_dir
  • Backward Compatibility: Defaults to "data" directory if not configured
  • Cross-Project Integration: Enables the karaoke downloader to be used as a component in other projects with different data directory structures

Core Modules:

  • downloader.py: Main orchestrator and CLI interface
  • video_downloader.py: Core video download execution and orchestration
  • tracking_manager.py: Download tracking and status management
  • download_planner.py: Download plan building and channel scanning
  • cache_manager.py: Cache operations and file I/O management
  • channel_manager.py: Channel and file management operations
  • songlist_manager.py: Songlist operations and tracking
  • server_manager.py: Server song availability checking
  • fuzzy_matcher.py: Fuzzy matching logic and similarity functions

Utility Modules (v3.2):

  • youtube_utils.py: Centralized YouTube operations and yt-dlp command generation
  • error_utils.py: Standardized error handling and formatting
  • download_pipeline.py: Abstracted download → verify → tag → track pipeline
  • id3_utils.py: ID3 tagging utilities
  • config_manager.py: Configuration management
  • resolution_cli.py: Resolution checking utilities
  • tracking_cli.py: Tracking management CLI

New Utility Modules (v3.3):

  • file_utils.py: Centralized file operations, filename sanitization, and file validation
  • song_validator.py: Centralized song validation logic for checking if songs should be downloaded

New Utility Modules (v3.4.7):

  • data_path_manager.py: Centralized data directory path management and file path resolution

Unified Download Workflow (v3.4.5)

  • execute_unified_download_workflow(): Centralized download execution that all modes use
  • _execute_sequential_downloads(): Sequential download execution using DownloadPipeline
  • _execute_parallel_downloads(): Parallel download execution using ParallelDownloader

Benefits of Enhanced Modular Architecture:

  • Single Responsibility: Each module has a focused purpose
  • Centralized Utilities: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
  • Reduced Duplication: Eliminated ~150 lines of code duplication across modules
  • Testability: Individual components can be tested separately
  • Maintainability: Easier to find and fix issues
  • Reusability: Components can be used independently
  • Robustness: Better error handling and interruption recovery
  • Consistency: Standardized error messages and processing pipelines
  • Type Safety: Comprehensive type hints across all new modules
  • Unified Execution: All download modes use the same execution pipeline for consistency

🔧 Development Guidelines

Adding New Download Modes

When adding new download modes, follow the unified workflow pattern to ensure consistency:

1. Build Download Plan (Mode-Specific)

def download_new_mode(self, ...):
    # Build download plan with standard structure
    download_plan = []
    for video in videos_to_download:
        download_plan.append({
            "video_id": video["id"],
            "artist": artist,
            "title": title,
            "filename": filename,
            "channel_name": channel_name,
            "video_title": video["title"],
            "force_download": force_download
        })
    
    # Use unified execution workflow
    downloaded_count, success = self.execute_unified_download_workflow(
        download_plan=download_plan,
        cache_file=cache_file,
        limit=limit,
        show_progress=True,
    )
    
    return success

2. Key Principles

  • NEVER implement custom download execution logic - always use execute_unified_download_workflow()
  • Focus on download plan building - that's where mode-specific logic belongs
  • Use the standard download plan structure for consistency
  • Implement cache file handling for progress tracking and resume functionality
  • Test with both sequential and parallel modes to ensure compatibility

3. Benefits of Unified Architecture

  • Consistency: All modes behave identically for execution, progress tracking, and error handling
  • Automatic Features: New modes automatically get parallel downloads, progress tracking, and cache management
  • Maintainability: Changes to download execution only need to be made in one place
  • Reliability: Eliminates broken pipelines and inconsistent behavior between modes

🔧 Recent Improvements (v3.4.1)

Enhanced Fuzzy Matching

  • Improved title parsing: Enhanced extract_artist_title function to handle multiple video title formats
  • Better matching accuracy: Can now parse titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
  • Consistent parsing: All modules now use the same parsing logic from fuzzy_matcher.py
  • Reduced false negatives: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found

Fixed Import Conflicts

  • Resolved import conflicts: Updated modules to use the enhanced extract_artist_title from fuzzy_matcher.py
  • Consistent behavior: All parts of the system use the same parsing logic
  • Cleaner codebase: Eliminated duplicate code and import conflicts

Fixed --limit Parameter

  • Correct limit application: The --limit parameter now properly limits the scanning phase, not just downloads
  • Improved performance: When using --limit N, only the first N songs are scanned, significantly reducing processing time
  • Accurate logging: Logging messages now show the correct counts for songs that will actually be processed when using --limit

Code Quality Improvements

  • Eliminated duplicate functions: Removed duplicate extract_artist_title implementations
  • Fixed import conflicts: Resolved inconsistencies between different parsing implementations
  • Single source of truth: All title parsing logic is now centralized in fuzzy_matcher.py

🔧 Recent Improvements (v3.4.5)

Unified Download Workflow Architecture

  • Unified execution pipeline: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
  • Consistent behavior: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
  • Centralized download logic: Single execute_unified_download_workflow() method handles all download execution
  • Automatic parallel support: All download modes automatically support --parallel --workers N without additional implementation
  • Unified cache management: Consistent progress tracking and resume functionality across all modes

What Was Fixed

  • Broken Pipeline: Previously, different modes used different execution paths, leading to inconsistencies
  • Missing Method: Added missing download_latest_per_channel() method that was referenced in CLI but not implemented
  • Code Duplication: Eliminated duplicate download execution logic across different modes
  • Inconsistent Behavior: All modes now have identical progress tracking, error handling, and cache management

Benefits

  • Consistency: All modes behave identically for execution, progress tracking, and error handling
  • Maintainability: Changes to download execution only need to be made in one place
  • Reliability: Eliminates broken pipelines and inconsistent behavior between modes
  • Extensibility: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
  • Testing: Easier to test since all modes use the same execution logic

🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)

Duplicate File Prevention

  • Enhanced file existence checking: Now detects files with (2), (3), etc. suffixes that yt-dlp creates
  • Automatic duplicate prevention: Skips downloads when files already exist (including duplicates)
  • Updated yt-dlp configuration: Set "nooverwrites": false to prevent yt-dlp from creating duplicate files
  • Cleanup utility: data/cleanup_duplicate_files.py helps identify and remove existing duplicate files

Filename vs ID3 Tag Consistency

  • Consistent metadata: Filename and ID3 tag now use identical artist/title format
  • Removed extra suffixes: No more "(Karaoke Version)" in ID3 tags that don't match filenames
  • Unified parsing: Both filename generation and ID3 tagging use the same artist/title extraction

Benefits

  • No more duplicate files with (2), (3) suffixes
  • Consistent metadata between filename and ID3 tags
  • Efficient disk usage by preventing unnecessary downloads
  • Clear file identification with consistent naming

Clean Up Existing Duplicates

# Run the cleanup utility to find and remove existing duplicates
python data/cleanup_duplicate_files.py

# Choose option 1 for dry run (recommended first)
# Choose option 2 to actually delete duplicates

📋 Requirements

  • Windows 10/11 or macOS 10.14+
  • Python 3.7+
  • yt-dlp binary (platform-specific, see setup instructions below)
  • mutagen (for ID3 tagging, optional)
  • ffmpeg/ffprobe (for video validation, optional but recommended)
  • rapidfuzz (for fuzzy matching, optional, falls back to difflib)

🍎 macOS Setup

Run the macOS setup script to automatically set up yt-dlp and FFmpeg:

python3 setup_macos.py

This script will:

  • Detect your macOS version
  • Offer installation options for yt-dlp (pip or binary download)
  • Install FFmpeg via Homebrew
  • Test the installation

Manual Setup

If you prefer to set up manually:

Option 1: Install yt-dlp via pip

pip3 install yt-dlp

Option 2: Download yt-dlp binary

mkdir -p downloader
curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
chmod +x downloader/yt-dlp_macos

Install FFmpeg

brew install ffmpeg

Test Installation

python3 src/tests/test_macos.py

🚀 Quick Start

💡 Pro Tip: For a complete list of all available commands, see commands.txt - you can copy/paste any command directly into your terminal!

Download a Channel

python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos

Download ALL Videos from a Channel (Not Just Songlist Matches)

python download_karaoke.py --channel-focus SingKingKaraoke --all-videos

Download ALL Videos with Parallel Processing

python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10

Download ALL Videos with Limit

python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100

Download Only Songlist Songs (Fast Mode)

python download_karaoke.py --songlist-only --limit 5

Download with Parallel Processing

python download_karaoke.py --parallel --songlist-only --limit 10

Focus on Specific Playlists by Title

python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"

Focus on Specific Playlists from Custom File

python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"

Force Download from Channels (Bypass All Existing File Checks)

python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force

Download with Fuzzy Matching

python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85

Test Download Plan (Dry Run)

python download_karaoke.py --songlist-only --limit 5 --dry-run

Test Channel Download Plan (Dry Run)

python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run

Download Latest N Videos Per Channel

python download_karaoke.py --latest-per-channel --limit 5

Download Latest N Videos Per Channel (with fuzzy matching)

python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85

Prioritize Songlist in Download Queue

python download_karaoke.py --songlist-priority

Show Songlist Download Progress

python download_karaoke.py --songlist-status

Limit Number of Downloads

python download_karaoke.py --limit 5

Override Resolution

python download_karaoke.py --resolution 1080p

Reset/Start Over for a Channel

python download_karaoke.py --reset-channel SingKingKaraoke

Reset Channel and Songlist Songs

python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist

Clear Channel Cache

python download_karaoke.py --clear-cache SingKingKaraoke
python download_karaoke.py --clear-cache all

🧠 Songlist Integration

  • Place your prioritized song list in data/songList.json (see example format below).
  • The tool will match and prioritize these songs across all available channel videos.
  • Use --songlist-only to download only these songs, or --songlist-priority to prioritize them in the queue.
  • Use --songlist-focus to download only songs from specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100").
  • Download progress for the songlist is tracked globally in data/songlist_tracking.json.

Example data/songList.json

[
  {
    "title": "2025 - Apple Top 50",
    "songs": [
      { "artist": "Kendrick Lamar & SZA", "title": "luther", "position": 1 },
      { "artist": "Kendrick Lamar", "title": "Not Like Us", "position": 2 }
    ]
  },
  {
    "title": "2024 - Billboard Hot 100",
    "songs": [
      { "artist": "Taylor Swift", "title": "Cruel Summer", "position": 1 },
      { "artist": "Billie Eilish", "title": "Happier Than Ever", "position": 2 }
    ]
  }
]

🛠️ Tracking & Caching

  • data/karaoke_tracking.json: Tracks all downloads, statuses, and formats
  • data/songlist_tracking.json: Tracks global songlist download progress
  • data/server_duplicates_tracking.json: Tracks songs found to be duplicates on the server for future skipping
  • data/channel_cache.json: Caches channel video lists for performance

📂 Folder Structure

KaroakeVideoDownloader/
├── commands.txt               # Complete CLI commands reference (copy/paste ready)
├── karaoke_downloader/         # All core Python code and utilities
│   ├── downloader.py           # Main orchestrator and CLI interface
│   ├── cli.py                  # CLI entry point
│   ├── video_downloader.py     # Core video download execution and orchestration
│   ├── tracking_manager.py     # Download tracking and status management
│   ├── download_planner.py     # Download plan building and channel scanning
│   ├── cache_manager.py        # Cache operations and file I/O management
│   ├── channel_manager.py      # Channel and file management operations
│   ├── songlist_manager.py     # Songlist operations and tracking
│   ├── server_manager.py       # Server song availability checking
│   ├── fuzzy_matcher.py        # Fuzzy matching logic and similarity functions
│   ├── youtube_utils.py        # Centralized YouTube operations and yt-dlp commands
│   ├── error_utils.py          # Standardized error handling and formatting
│   ├── download_pipeline.py    # Abstracted download → verify → tag → track pipeline
│   ├── id3_utils.py            # ID3 tagging utilities
│   ├── config_manager.py       # Configuration management with dataclasses
│   ├── file_utils.py           # Centralized file operations and filename handling
│   ├── song_validator.py       # Centralized song validation logic
│   ├── check_resolution.py     # Resolution checker utility
│   ├── resolution_cli.py       # Resolution config CLI
│   └── tracking_cli.py         # Tracking management CLI
├── config.json               # Main configuration file
├── data/                     # All tracking, cache, and songlist files
│   ├── karaoke_tracking.json
│   ├── songlist_tracking.json
│   ├── channel_cache.json
│   ├── channels.txt
│   └── songList.json
├── downloads/                 # All video output
│   └── [ChannelName]/         # Per-channel folders
├── logs/                      # Download logs
├── downloader/yt-dlp.exe      # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos    # yt-dlp binary (macOS)
├── src/tests/                 # Test scripts
│   ├── test_macos.py         # macOS setup and functionality tests
│   └── test_platform.py      # Platform detection tests
├── download_karaoke.py        # Main entry point (thin wrapper)
├── README.md
├── PRD.md
├── requirements.txt
└── download_karaoke.bat       # (optional Windows launcher)

🚦 CLI Options

📋 Complete Command Reference: See commands.txt for all available commands with examples - perfect for copy/paste!

Key Options:

  • --file <data/channels.txt>: Download from a list of channels (optional, defaults to data/channels.txt for songlist modes)
  • --songlist-priority: Prioritize songlist songs in download queue
  • --songlist-only: Download only songs from the songlist
  • --songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...: Focus on specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")
  • --songlist-file <FILE_PATH>: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
  • --songlist-status: Show songlist download progress
  • --limit <N>: Limit number of downloads (enables fast mode with early exit)
  • --resolution <720p|1080p|...>: Override resolution
  • --status: Show download/tracking status
  • --reset-channel <CHANNEL_NAME>: Reset all tracking and files for a channel
  • --reset-songlist: When used with --reset-channel, also reset songlist songs for this channel
  • --clear-cache <CHANNEL_ID|all>: Clear channel video cache for a specific channel or all
  • --clear-server-duplicates: Clear server duplicates tracking (allows re-checking songs against server)
  • --latest-per-channel: Download the latest N videos from each channel (use with --limit)
  • --fuzzy-match: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
  • --fuzzy-threshold <N>: Fuzzy match threshold (0-100, default 85)
  • --parallel: Enable parallel downloads for improved speed (defaults to 3 workers)
  • --workers <N>: Number of parallel download workers (1-10, default: 3, only used with --parallel)
  • --generate-songlist <DIR1> <DIR2>...: Generate song list from MP4 files with ID3 tags in specified directories
  • --no-append-songlist: Create a new song list instead of appending when using --generate-songlist
  • --force: Force download from channels, bypassing all existing file checks and re-downloading if necessary
  • --channel-focus <CHANNEL_NAME>: Download from a specific channel by name (e.g., 'SingKingKaraoke')
  • --all-videos: Download all videos from channel (not just songlist matches), skipping existing files
  • --dry-run: Build download plan and show what would be downloaded without actually downloading anything

📝 Example Usage

💡 For complete examples: See commands.txt for all command variations with explanations!

# Fast mode with fuzzy matching (no need to specify --file)
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85

# Parallel downloads for faster processing
python download_karaoke.py --parallel --songlist-only --limit 10

# Latest videos per channel with parallel downloads
python download_karaoke.py --parallel --latest-per-channel --limit 5

# Traditional full scan (no limit)
python download_karaoke.py --songlist-only

# Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10

# Focus on specific playlists from a custom file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --limit 10

# Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10

# Channel-specific operations
python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates

# Download ALL videos from a specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100

# Song list generation from MP4 files
python download_karaoke.py --generate-songlist /path/to/mp4/directory
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist

# Generate report of songs that couldn't be found
python download_karaoke.py --generate-unmatched-report
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85

🏷️ ID3 Tagging

  • Adds artist/title/album/genre to MP4 files using mutagen (if installed)

📋 Song List Generation

  • Generate song lists from existing MP4 files: Use --generate-songlist to create song lists from directories containing MP4 files with ID3 tags
  • Automatic ID3 extraction: Extracts artist and title from MP4 files' ID3 tags
  • Directory-based organization: Each directory becomes a playlist with the directory name as the title
  • Position tracking: Songs are numbered starting from 1 based on file order
  • Append or replace: Choose to append to existing song list or create a new one with --no-append-songlist
  • Multiple directories: Process multiple directories in a single command

🧹 Cleanup

  • Removes .info.json and .meta files after download

🛠️ Configuration

  • All options are in config.json (format, resolution, metadata, etc.)
  • You can edit this file or use CLI flags to override
  • Configurable Data Directory: The data directory path can be configured in config.json under folder_structure.data_dir (default: "data")

📋 Command Reference File

commands.txt contains a comprehensive list of all CLI commands with explanations. This file is designed for easy copy/paste usage and includes:

  • All basic download commands
  • Songlist operations
  • Latest-per-channel downloads
  • Cache and tracking management
  • Reset and cleanup operations
  • Advanced combinations
  • Common workflows
  • Troubleshooting commands

🔄 Maintenance Note: The commands.txt file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.

📚 Documentation Standards

Documentation Location

  • All changes, refactoring, and improvements should be documented in the PRD.md and README.md files
  • Do NOT create separate .md files for documenting changes, refactoring, or improvements
  • Use the existing sections in PRD.md and README.md to track all project evolution

Where to Document Changes

  • PRD.md: Technical details, architecture changes, bug fixes, and implementation specifics
  • README.md: User-facing features, usage instructions, and high-level improvements
  • CHANGELOG.md: Version-specific release notes and change summaries

Documentation Requirements

  • All new features must be documented in both PRD.md and README.md
  • All refactoring efforts must be documented in the appropriate sections
  • All bug fixes must be documented with technical details
  • Version numbers and dates should be clearly marked
  • Benefits and improvements should be explicitly stated

Maintenance Responsibility

  • Keep PRD.md and README.md synchronized with code changes
  • Update documentation immediately when implementing new features
  • Remove outdated information and consolidate related changes
  • Ensure all CLI options and features are documented in both files

🔧 Refactoring Improvements (v3.3)

The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:

New Utility Modules (v3.3)

  • file_utils.py: Centralized file operations, filename sanitization, and file validation

    • sanitize_filename(): Create safe filenames from artist/title
    • generate_possible_filenames(): Generate filename patterns for different modes
    • check_file_exists_with_patterns(): Check for existing files using multiple patterns
    • is_valid_mp4_file(): Validate MP4 files with header checking
    • cleanup_temp_files(): Remove temporary yt-dlp files
    • ensure_directory_exists(): Safe directory creation
  • song_validator.py: Centralized song validation logic

    • SongValidator class: Unified logic for checking if songs should be downloaded
    • should_skip_song(): Comprehensive validation with multiple criteria
    • mark_song_failed(): Consistent failure tracking
    • handle_download_failure(): Standardized error handling
  • Enhanced config_manager.py: Robust configuration management with dataclasses

    • ConfigManager class: Type-safe configuration loading and caching
    • DownloadSettings, FolderStructure, LoggingConfig dataclasses
    • Configuration validation and merging with defaults
    • Dynamic resolution updates

Benefits Achieved

  • Eliminated Code Duplication: ~150 lines of duplicate code removed across modules
  • Centralized File Operations: Single source of truth for filename handling and file validation
  • Unified Song Validation: Consistent logic for checking if songs should be downloaded
  • Enhanced Type Safety: Comprehensive type hints across all new modules
  • Improved Configuration Management: Structured configuration with validation and caching
  • Better Error Handling: Consistent patterns via centralized utilities
  • Enhanced Maintainability: Changes to file operations or song validation only require updates in one place
  • Improved Testability: Modular components can be tested independently
  • Better Developer Experience: Clear function signatures and comprehensive documentation

New Parallel Download System (v3.4)

  • Parallel downloader module: parallel_downloader.py provides thread-safe concurrent download management
  • Configurable concurrency: Use --parallel to enable parallel downloads with 3 workers by default, or --parallel --workers N for custom worker count (1-10)
  • Thread-safe operations: All tracking, caching, and progress operations are thread-safe
  • Real-time progress tracking: Shows active downloads, completion status, and overall progress
  • Automatic retry mechanism: Failed downloads are automatically retried with reduced concurrency
  • Backward compatibility: Sequential downloads remain the default when --parallel is not used
  • Performance improvements: Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
  • Integrated with all modes: Works with both songlist-across-channels and latest-per-channel download modes

Previous Improvements (v3.2)

  • Centralized yt-dlp Command Generation: Standardized command building and execution across all download operations
  • Enhanced Error Handling: Structured exception hierarchy with consistent error messages and formatting
  • Abstracted Download Pipeline: Reusable download → verify → tag → track process for consistent processing
  • Download plan pre-scan: Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
  • Latest-per-channel plan: Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
  • Fast mode with early exit: When a limit is set, the tool scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads. This provides much faster performance for small limits compared to the full pre-scan approach.
  • Deduplication across channels: Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
  • Fuzzy matching: Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
  • Default channel file: For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
  • Robust interruption handling: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
  • Optimized scanning algorithm: High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
  • Enhanced cache management: Improved channel cache key handling for better cache hit rates and reduced YouTube API calls.
  • Robust download plan execution: Fixed index management in download plan execution to prevent errors during interrupted downloads.

🐞 Troubleshooting

  • Windows: Ensure yt-dlp.exe is in the downloader/ folder
  • macOS: Run python3 setup_macos.py to set up yt-dlp and FFmpeg
  • Check logs/ for error details
  • Use python -m karaoke_downloader.check_resolution to verify video quality
  • If you see errors about ffmpeg/ffprobe, install ffmpeg and ensure it is in your PATH
  • For best fuzzy matching, install rapidfuzz: pip install rapidfuzz (otherwise falls back to slower, less accurate difflib)

Happy Karaoke! 🎤