Compare commits

..

1 Commits

Author SHA1 Message Date
bed46ff2d2 Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-05 12:59:09 -05:00
61 changed files with 180049 additions and 444110 deletions

3
.gitignore vendored
View File

@ -14,6 +14,9 @@ logs/
*.log *.log
# Tracking and cache files # Tracking and cache files
karaoke_tracking.json
karaoke_tracking.json.backup
songlist_tracking.json
*.cache *.cache
# yt-dlp temporary files # yt-dlp temporary files

385
PRD.md
View File

@ -1,8 +1,8 @@
# 🎤 Karaoke Video Downloader PRD (v3.4.4) # 🎤 Karaoke Video Downloader PRD (v3.5)
## ✅ Overview ## ✅ Overview
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse. A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows, macOS, and Linux with automatic platform detection and optimized caching. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.
--- ---
@ -63,7 +63,7 @@ The codebase has been refactored into focused modules with centralized utilities
--- ---
## ⚙️ Platform & Stack ## ⚙️ Platform & Stack
- **Platform:** Windows, macOS - **Platform:** Windows, macOS, Linux
- **Interface:** Command-line (CLI) - **Interface:** Command-line (CLI)
- **Tech Stack:** Python 3.7+, yt-dlp (platform-specific binary), mutagen (for ID3 tagging) - **Tech Stack:** Python 3.7+, yt-dlp (platform-specific binary), mutagen (for ID3 tagging)
@ -101,7 +101,6 @@ python download_karaoke.py --clear-cache SingKingKaraoke
- ✅ Songlist integration: prioritize and track custom songlists - ✅ Songlist integration: prioritize and track custom songlists
- ✅ Songlist-only mode: download only songs from the songlist - ✅ Songlist-only mode: download only songs from the songlist
- ✅ Songlist focus mode: download only songs from specific playlists by title - ✅ Songlist focus mode: download only songs from specific playlists by title
- ✅ Force download mode: bypass all existing file checks and re-download songs regardless of server duplicates or existing files
- ✅ Global songlist tracking to avoid duplicates across channels - ✅ Global songlist tracking to avoid duplicates across channels
- ✅ ID3 tagging for artist/title in MP4 files (mutagen) - ✅ ID3 tagging for artist/title in MP4 files (mutagen)
- ✅ Real-time progress and detailed logging - ✅ Real-time progress and detailed logging
@ -123,8 +122,6 @@ python download_karaoke.py --clear-cache SingKingKaraoke
- ✅ **Centralized file operations**: Single source of truth for filename sanitization, file validation, and path operations - ✅ **Centralized file operations**: Single source of truth for filename sanitization, file validation, and path operations
- ✅ **Centralized song validation**: Unified logic for checking if songs should be downloaded across all modules - ✅ **Centralized song validation**: Unified logic for checking if songs should be downloaded across all modules
- ✅ **Enhanced configuration management**: Structured configuration with dataclasses, type safety, and validation - ✅ **Enhanced configuration management**: Structured configuration with dataclasses, type safety, and validation
- ✅ **Manual video collection**: Static video collection system for managing individual karaoke videos that don't belong to regular channels. Use `--manual` to download from `data/manual_videos.json`.
- ✅ **Channel-specific parsing rules**: JSON-based configuration for parsing video titles from different YouTube channels, with support for various title formats and cleanup rules.
--- ---
@ -152,34 +149,21 @@ KaroakeVideoDownloader/
│ ├── check_resolution.py # Resolution checker utility │ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI │ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI │ └── tracking_cli.py # Tracking management CLI
├── config/ # Configuration files ├── data/ # All config, tracking, cache, and songlist files
│ └── config.json # Main configuration file │ ├── config.json
├── data/ # All tracking, cache, and songlist files
│ ├── karaoke_tracking.json │ ├── karaoke_tracking.json
│ ├── songlist_tracking.json │ ├── songlist_tracking.json
│ ├── channel_cache.json │ ├── channel_cache.json
│ ├── channels.json # Channel configuration with parsing rules │ ├── channels.txt
│ ├── manual_videos.json # Manual video collection
│ └── songList.json │ └── songList.json
├── utilities/ # Utility scripts and tools
│ ├── add_manual_video.py # Manual video management
│ ├── build_cache_from_raw.py # Cache building utility
│ ├── cleanup_duplicate_files.py # File cleanup utilities
│ ├── cleanup_recent_tracking.py # Tracking cleanup utilities
│ ├── deduplicate_songlist_tracking.py # Data deduplication
│ ├── fix_artist_name_format.py # Data cleanup utilities
│ ├── fix_artist_name_format_simple.py
│ ├── fix_code_quality.py # Development tools
│ ├── reset_and_redownload.py # Maintenance utilities
│ └── songlist_report.py # Reporting utilities
├── downloads/ # All video output ├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders │ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs ├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary (Windows) ├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos # yt-dlp binary (macOS) ├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
├── src/tests/ # Test scripts ├── downloader/yt-dlp # yt-dlp binary (Linux)
│ ├── test_macos.py # macOS setup and functionality tests ├── tests/ # Diagnostic and test scripts
│ └── test_platform.py # Platform detection tests │ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper) ├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md ├── README.md
├── PRD.md ├── PRD.md
@ -194,8 +178,6 @@ KaroakeVideoDownloader/
- `--songlist-priority`: Prioritize songlist songs in download queue - `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist - `--songlist-only`: Download only songs from the songlist
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`) - `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
- `--songlist-file <FILE_PATH>`: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
- `--songlist-status`: Show songlist download progress - `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit) - `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution - `--resolution <720p|1080p|...>`: Override resolution
@ -208,11 +190,7 @@ KaroakeVideoDownloader/
- `--fuzzy-match`: **Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)** - `--fuzzy-match`: **Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)**
- `--fuzzy-threshold <N>`: **Fuzzy match threshold (0-100, default 85)** - `--fuzzy-threshold <N>`: **Fuzzy match threshold (0-100, default 85)**
- `--parallel`: **Enable parallel downloads for improved speed** - `--parallel`: **Enable parallel downloads for improved speed**
- `--workers <N>`: **Number of parallel download workers (1-10, default: 3, only used with --parallel)** - `--workers <N>`: **Number of parallel download workers (1-10, default: 3)**
- `--manual`: **Download from manual videos collection (data/manual_videos.json)**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files and songs in songs.json**
- `--dry-run`: **Build download plan and show what would be downloaded without actually downloading anything**
--- ---
@ -223,8 +201,6 @@ KaroakeVideoDownloader/
- **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files. - **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files.
- **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download. - **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download.
- **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels. - **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels.
- **Channel-Specific Parsing:** Uses `data/channels.json` to define parsing rules for each YouTube channel, handling different video title formats (e.g., "Artist - Title", "Artist Title", "Title | Artist", etc.).
- **Manual Video Collection:** Static video management system using `data/manual_videos.json` for individual karaoke videos that don't belong to regular channels. Accessible via `--manual` parameter.
## 🔧 Refactoring Improvements (v3.3) ## 🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization: The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
@ -278,7 +254,7 @@ The codebase has been comprehensively refactored to improve maintainability and
### **New Parallel Download System (v3.4)** ### **New Parallel Download System (v3.4)**
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management - **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
- **Configurable concurrency:** Use `--parallel` to enable parallel downloads with 3 workers by default, or `--parallel --workers N` for custom worker count (1-10) - **Configurable concurrency:** Use `--parallel --workers N` to enable parallel downloads with N workers (1-10)
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe - **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress - **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency - **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
@ -286,245 +262,16 @@ The codebase has been comprehensively refactored to improve maintainability and
- **Performance improvements:** Significantly faster downloads for large batches (3-5x speedup with 3-5 workers) - **Performance improvements:** Significantly faster downloads for large batches (3-5x speedup with 3-5 workers)
- **Integrated with all modes:** Works with both songlist-across-channels and latest-per-channel download modes - **Integrated with all modes:** Works with both songlist-across-channels and latest-per-channel download modes
--- ### **Cross-Platform Support (v3.5)**
- **Platform detection:** Automatic detection of Windows, macOS, and Linux systems
## 🚀 Future Enhancements - **Flexible yt-dlp integration:** Supports both binary files and pip-installed yt-dlp modules
- [ ] Web UI for easier management - **Platform-specific configuration:** Automatic selection of appropriate yt-dlp binary/command for each platform
- [ ] More advanced song matching (multi-language) - **Setup automation:** `setup_platform.py` script for easy platform-specific setup
- [ ] Download scheduling and retry logic - **Command parsing:** Intelligent parsing of yt-dlp commands (file paths vs. module commands)
- [ ] More granular status reporting - **Enhanced documentation:** Platform-specific setup instructions and troubleshooting
- [x] **Parallel downloads for improved speed** ✅ **COMPLETED** - **Backward compatibility:** Maintains full compatibility with existing Windows installations
- [x] **Enhanced fuzzy matching with improved video title parsing** ✅ **COMPLETED** - **FFmpeg integration:** Automatic FFmpeg installation and configuration for optimal video processing
- [x] **Consolidated extract_artist_title function** ✅ **COMPLETED** - **Optimized caching:** Enhanced channel video caching with format compatibility and instant video list loading
- [x] **Duplicate file prevention and filename consistency** ✅ **COMPLETED**
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows
- [ ] Plugin system for custom file operations
- [ ] Advanced configuration UI
- [ ] Real-time download progress visualization
## 🔧 Recent Bug Fixes & Improvements (v3.4.1)
### **Enhanced Fuzzy Matching (v3.4.1)**
- **Improved `extract_artist_title` function**: Enhanced to handle multiple video title formats beyond simple "Artist - Title" patterns
- **"Title Karaoke | Artist Karaoke Version" format**: Correctly parses titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
- **"Title Artist KARAOKE" format**: Handles titles ending with "KARAOKE" and attempts to extract artist information
- **Fallback handling**: Returns empty artist and full title for unparseable formats
- **Consolidated function usage**: Removed duplicate `extract_artist_title` implementations across modules
- **Single source of truth**: All modules now import from `fuzzy_matcher.py`
- **Consistent parsing**: Eliminated inconsistencies between different parsing implementations
- **Better maintainability**: Changes to parsing logic only need to be made in one place
### **Fixed Import Conflicts**
- **Resolved import conflict in `download_planner.py`**: Updated to use the enhanced `extract_artist_title` from `fuzzy_matcher.py` instead of the simpler version from `id3_utils.py`
- **Updated `id3_utils.py`**: Now imports `extract_artist_title` from `fuzzy_matcher.py` for consistency
### **Enhanced --limit Parameter**
- **Fixed limit application**: The `--limit` parameter now correctly applies to the scanning phase, not just the download execution
- **Improved performance**: When using `--limit N`, only the first N songs are scanned against channels, significantly reducing processing time for large songlists
### **Benefits of Recent Improvements**
- **Better matching accuracy**: Enhanced fuzzy matching can now handle a wider variety of video title formats commonly found on YouTube karaoke channels
- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
- **Consistent behavior**: All parts of the system use the same parsing logic, eliminating edge cases where different modules would parse the same title differently
- **Improved performance**: The `--limit` parameter now works as expected, providing faster processing for targeted downloads
- **Cleaner codebase**: Eliminated duplicate code and import conflicts, making the system more maintainable
## 🔧 Recent Bug Fixes & Improvements (v3.4.2)
### **Duplicate File Prevention & Filename Consistency**
- **Enhanced file existence checking**: `check_file_exists_with_patterns()` now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
- **Automatic duplicate prevention**: Download pipeline skips downloads when files already exist (including duplicates)
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files with suffixes
- **Cleanup utility**: `data/cleanup_duplicate_files.py` provides interactive cleanup of existing duplicate files
- **Filename vs ID3 tag consistency**: Removed "(Karaoke Version)" suffix from ID3 tags to match filenames exactly
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction logic
### **Benefits of Duplicate Prevention**
- **No more duplicate files**: Eliminates `(2)`, `(3)` suffix files that waste disk space
- **Consistent metadata**: Filename and ID3 tag use identical artist/title format
- **Efficient disk usage**: Prevents unnecessary downloads of existing files
- **Clear file identification**: Consistent naming across all file operations
## 🛠️ Maintenance
### **Regular Cleanup**
- Run the cleanup utility periodically to remove any duplicate files
- Monitor downloads for any new duplicate creation (should be rare with fixes)
### **Configuration**
- Keep `"nooverwrites": false` in `data/config.json`
- This prevents yt-dlp from creating duplicate files
### **Monitoring**
- Check logs for "⏭️ Skipping download - file already exists" messages
- These indicate the duplicate prevention is working correctly
## 🔧 Recent Bug Fixes & Improvements (v3.4.3)
### **Manual Video Collection System**
- **New `--manual` parameter**: Simple access to manual video collection via `python download_karaoke.py --manual --limit 5`
- **Static video management**: `data/manual_videos.json` stores individual karaoke videos that don't belong to regular channels
- **Helper script**: `add_manual_video.py` provides easy management of manual video entries
- **Full integration**: Manual videos work with all existing features (songlist matching, fuzzy matching, parallel downloads, etc.)
- **No yt-dlp dependency**: Manual videos bypass YouTube API calls for video listing, using static data instead
### **Channel-Specific Parsing Rules**
- **JSON-based configuration**: `data/channels.json` replaces `data/channels.txt` with structured channel configuration
- **Parsing rules per channel**: Each channel can define custom parsing rules for video titles
- **Multiple format support**: Handles various title formats like "Artist - Title", "Artist Title", "Title | Artist", etc.
- **Suffix cleanup**: Automatic removal of common karaoke-related suffixes
- **Multi-artist support**: Parsing for titles with multiple artists separated by specific delimiters
- **Backward compatibility**: Still supports legacy `data/channels.txt` format
### **Benefits of New Features**
- **Flexible video management**: Easy addition of individual karaoke videos without creating new channels
- **Accurate parsing**: Channel-specific rules ensure correct artist/title extraction for ID3 tags and filenames
- **Consistent metadata**: Proper parsing prevents filename and ID3 tag inconsistencies
- **Easy maintenance**: Simple JSON structure for managing both channels and manual videos
- **Full feature compatibility**: Manual videos work seamlessly with existing download modes and features
## 📚 Documentation Standards
### **Documentation Location**
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
- **Use the existing sections in PRD.md and README.md to track all project evolution**
### **Where to Document Changes**
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
- **README.md**: User-facing features, usage instructions, and high-level improvements
- **CHANGELOG.md**: Version-specific release notes and change summaries
### **Documentation Requirements**
- **All new features must be documented in both PRD.md and README.md**
- **All refactoring efforts must be documented in the appropriate sections**
- **All bug fixes must be documented with technical details**
- **Version numbers and dates should be clearly marked**
- **Benefits and improvements should be explicitly stated**
### **Maintenance Responsibility**
- **Keep PRD.md and README.md synchronized with code changes**
- **Update documentation immediately when implementing new features**
- **Remove outdated information and consolidate related changes**
- **Ensure all CLI options and features are documented in both files**
## 🔧 Recent Bug Fixes & Improvements (v3.4.4)
### **All Videos Download Mode**
- **New `--all-videos` parameter**: Download all videos from a channel, not just songlist matches
- **Smart MP3/MP4 detection**: Automatically detects if you have MP3 versions in songs.json and downloads MP4 video versions
- **Existing file skipping**: Skips videos that already exist on the filesystem
- **Progress tracking**: Shows clear progress with "Downloading X/Y videos" format
- **Parallel processing support**: Works with `--parallel --workers N` for faster downloads
- **Channel focus integration**: Works with `--channel-focus` to target specific channels
- **Limit support**: Works with `--limit N` to control download batch size
### **Smart Songlist Integration**
- **MP4 version detection**: Checks if MP4 version already exists in songs.json before downloading
- **MP3 upgrade path**: Downloads MP4 video versions when only MP3 versions exist in songlist
- **Duplicate prevention**: Skips downloads when MP4 versions already exist
- **Efficient filtering**: Only processes videos that need to be downloaded
### **Benefits of All Videos Mode**
- **Complete channel downloads**: Download entire channels without songlist restrictions
- **Automatic format upgrading**: Upgrade MP3 collections to MP4 video versions
- **Efficient processing**: Only downloads videos that don't already exist
- **Flexible control**: Use with limits, parallel processing, and channel targeting
- **Clear progress feedback**: Real-time progress tracking for large downloads
## 🔧 Recent Bug Fixes & Improvements (v3.4.5)
### **Unified Download Workflow Architecture**
- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution
- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation
- **Unified cache management**: Consistent progress tracking and resume functionality across all modes
### **Architecture Pattern for New Download Modes**
When adding new download modes in the future, follow this pattern to ensure consistency:
#### **1. Download Plan Building (Mode-Specific)**
Each download mode should build a download plan (list of videos to download) with this structure:
```python
download_plan = [
{
"video_id": "video_id",
"artist": "artist_name",
"title": "song_title",
"filename": "sanitized_filename.mp4",
"channel_name": "channel_name",
"video_title": "original_video_title",
"force_download": False
}
]
```
#### **2. Unified Execution (Shared)**
All modes should use the unified execution workflow:
```python
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file, # Optional, for progress tracking
limit=limit, # Optional, for limiting downloads
show_progress=True, # Optional, for progress display
)
```
#### **3. Execution Method Selection (Automatic)**
The unified workflow automatically chooses execution method based on settings:
- **Sequential**: Uses `DownloadPipeline` for single-threaded downloads
- **Parallel**: Uses `ParallelDownloader` when `--parallel` is enabled
#### **4. Required Implementation Pattern**
```python
def download_new_mode(self, ...):
"""New download mode implementation."""
# 1. Build download plan (mode-specific logic)
download_plan = []
for video in videos_to_download:
download_plan.append({
"video_id": video["id"],
"artist": artist,
"title": title,
"filename": filename,
"channel_name": channel_name,
"video_title": video["title"],
"force_download": force_download
})
# 2. Create cache file (optional, for progress tracking)
cache_file = get_download_plan_cache_file("new_mode", **plan_kwargs)
save_plan_cache(cache_file, download_plan, [])
# 3. Use unified execution workflow
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file,
limit=limit,
show_progress=True,
)
return success
```
### **Benefits of Unified Architecture**
- **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- **Maintainability**: Changes to download execution only need to be made in one place
- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
- **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
- **Testing**: Easier to test since all modes use the same execution logic
### **What Was Fixed**
- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies
- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented
- **Code Duplication**: Eliminated duplicate download execution logic across different modes
- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management
### **Future Development Guidelines**
1. **NEVER implement custom download execution logic** in new download modes
2. **ALWAYS use `execute_unified_download_workflow()`** for download execution
3. **Focus on download plan building** - that's where mode-specific logic belongs
4. **Use the standard download plan structure** for consistency
5. **Implement cache file handling** for progress tracking and resume functionality
6. **Test with both sequential and parallel modes** to ensure compatibility
--- ---
@ -534,97 +281,9 @@ def download_new_mode(self, ...):
- [ ] Download scheduling and retry logic - [ ] Download scheduling and retry logic
- [ ] More granular status reporting - [ ] More granular status reporting
- [x] **Parallel downloads for improved speed** ✅ **COMPLETED** - [x] **Parallel downloads for improved speed** ✅ **COMPLETED**
- [x] **Enhanced fuzzy matching with improved video title parsing** ✅ **COMPLETED** - [x] **Cross-platform support (Windows, macOS, Linux)** ✅ **COMPLETED**
- [x] **Consolidated extract_artist_title function** ✅ **COMPLETED**
- [x] **Duplicate file prevention and filename consistency** ✅ **COMPLETED**
- [ ] Unit tests for all modules - [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows - [ ] Integration tests for end-to-end workflows
- [ ] Plugin system for custom file operations - [ ] Plugin system for custom file operations
- [ ] Advanced configuration UI - [ ] Advanced configuration UI
- [ ] Real-time download progress visualization - [ ] Real-time download progress visualization
## 🔧 Recent Bug Fixes & Improvements (v3.4.4)
### **macOS Support with Automatic Platform Detection**
- **Cross-platform compatibility**: Added support for macOS alongside Windows
- **Automatic platform detection**: Detects operating system and selects appropriate yt-dlp binary
- **Flexible yt-dlp integration**: Supports both binary files (`yt-dlp_macos`) and pip installation (`python3 -m yt_dlp`)
- **Setup automation**: `setup_macos.py` script for easy macOS setup with FFmpeg and yt-dlp installation
- **Command parsing**: Intelligent parsing of yt-dlp commands (file paths vs. module commands)
- **Enhanced validation**: Platform-specific error messages and validation in CLI
- **Backward compatibility**: Maintains full compatibility with existing Windows installations
### **Benefits of macOS Support**
- **Native macOS experience**: No need for Windows compatibility layers or virtualization
- **Automatic setup**: Simple setup script handles all dependencies
- **Flexible installation**: Choose between binary download or pip installation
- **Consistent functionality**: All features work identically on both platforms
- **Easy maintenance**: Platform detection handles configuration automatically
### **Setup Instructions**
```bash
# Automatic setup (recommended)
python3 setup_macos.py
# Test installation
python3 src/tests/test_macos.py
# Manual setup options
# 1. Install yt-dlp via pip: pip3 install yt-dlp
# 2. Download binary: curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
# 3. Install FFmpeg: brew install ffmpeg
```
## 🔧 Recent Bug Fixes & Improvements (v3.4.7)
### **Configurable Data Directory Path**
- **Centralized Data Path Management**: New `data_path_manager.py` module provides unified data directory path management
- **Configurable Location**: Data directory path can be set in `config/config.json` under `folder_structure.data_dir`
- **Backward Compatibility**: Defaults to "data" directory if not configured
- **Cross-Project Integration**: Enables the karaoke downloader to be used as a component in other projects with different data directory structures
- **Updated All Modules**: All modules now use the data path manager instead of hardcoded "data/" paths
- **Utility Functions**: Provides `get_data_path()`, `get_data_dir()`, and `get_data_path_manager()` functions for easy access
- **Fixed Circular Dependency**: Moved `config.json` from `data/` to root directory to resolve chicken-and-egg problem
### **Benefits of Configurable Data Directory**
- **Flexible Deployment**: Can be integrated into other projects with different directory structures
- **Centralized Configuration**: Single point of configuration for all data file paths
- **Maintainable Code**: Eliminates hardcoded paths throughout the codebase
- **Easy Testing**: Can use temporary directories for testing without affecting production data
- **Future-Proof**: Makes it easier to change data directory structure in the future
### **Circular Dependency Solution**
The original implementation had a circular dependency problem:
- **Problem**: `config.json` was located in the `data/` directory
- **Issue**: To read the config file, we needed to know where the data directory is
- **Conflict**: But the data directory location is specified in the config file
- **Solution**: Moved `config.json` to the `config/` directory as a fixed location
- **Result**: Config file is always accessible in a dedicated config directory, and data directory can be configured within it
- **Backward Compatibility**: System still works with config files in custom data directories when explicitly specified
## 🔧 Recent Bug Fixes & Improvements (v3.4.6)
### **Dry Run Mode**
- **New `--dry-run` parameter**: Build download plan and show what would be downloaded without actually downloading anything
- **Plan preview**: Shows total videos in plan and preview of first 5 videos
- **Safe testing**: Test download configurations without consuming bandwidth or disk space
- **All mode support**: Works with all download modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel)
- **Progress simulation**: Shows what the download process would look like without executing it
### **Benefits of Dry Run Mode**
- **Safe testing**: Test complex download configurations without downloading anything
- **Plan validation**: Verify that the download plan contains the expected videos
- **Configuration debugging**: Troubleshoot download settings before committing to downloads
- **Resource conservation**: Save bandwidth and disk space during testing
- **User education**: Help users understand what the tool will do before running it
### **Example Usage**
```bash
# Test songlist download plan
python download_karaoke.py --songlist-only --limit 5 --dry-run
# Test channel download plan
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run
# Test with fuzzy matching
python download_karaoke.py --songlist-only --fuzzy-match --limit 3 --dry-run
```
### **Future Development Guidelines**

372
README.md
View File

@ -1,6 +1,6 @@
# 🎤 Karaoke Video Downloader # 🎤 Karaoke Video Downloader
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection. A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows, macOS, and Linux with automatic platform detection, optimized caching, and FFmpeg integration.
## ✨ Features ## ✨ Features
- 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist - 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist
@ -13,7 +13,7 @@ A Python-based cross-platform CLI tool to download karaoke videos from YouTube c
- 📈 **Real-Time Progress**: Detailed console and log output - 📈 **Real-Time Progress**: Detailed console and log output
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI - 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N. - 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- 🧩 **Enhanced Fuzzy Matching**: Advanced fuzzy string matching for songlist-to-video matching with improved video title parsing (handles multiple title formats like "Title Karaoke | Artist Karaoke Version") - 🧩 **Fuzzy Matching**: Optionally use fuzzy string matching for songlist-to-video matching (with --fuzzy-match, requires rapidfuzz for best results)
- ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads - ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list - 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time) - 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
@ -21,20 +21,13 @@ A Python-based cross-platform CLI tool to download karaoke videos from YouTube c
- ⚡ **Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching - ⚡ **Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server - 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
- ⚡ **Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup) - ⚡ **Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup)
- 📊 **Unmatched Songs Reports**: Generate detailed reports of songs that couldn't be found in any channel with `--generate-unmatched-report` - 🌐 **Cross-Platform Support**: Automatic platform detection and yt-dlp integration for Windows, macOS, and Linux
- 🛡️ **Duplicate File Prevention**: Automatically detects and prevents duplicate files with `(2)`, `(3)` suffixes, with cleanup utility for existing duplicates - 🚀 **Optimized Caching**: Enhanced channel video caching with instant video list loading
- 🏷️ **Consistent Metadata**: Filename and ID3 tag use identical artist/title format for clear file identification - 🎬 **FFmpeg Integration**: Automatic FFmpeg installation and configuration for optimal video processing
- 🍎 **macOS Support**: Automatic platform detection and setup with native macOS binaries and FFmpeg integration
## 🏗️ Architecture ## 🏗️ Architecture
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse: The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
### **Configurable Data Directory (v3.4.7)**
- **Centralized Data Path Management**: `data_path_manager.py` provides unified data directory path management
- **Configurable Location**: Data directory path can be set in `config/config.json` under `folder_structure.data_dir`
- **Backward Compatibility**: Defaults to "data" directory if not configured
- **Cross-Project Integration**: Enables the karaoke downloader to be used as a component in other projects with different data directory structures
### Core Modules: ### Core Modules:
- **`downloader.py`**: Main orchestrator and CLI interface - **`downloader.py`**: Main orchestrator and CLI interface
- **`video_downloader.py`**: Core video download execution and orchestration - **`video_downloader.py`**: Core video download execution and orchestration
@ -56,191 +49,90 @@ The codebase has been comprehensively refactored into a modular architecture wit
- **`tracking_cli.py`**: Tracking management CLI - **`tracking_cli.py`**: Tracking management CLI
### New Utility Modules (v3.3): ### New Utility Modules (v3.3):
- **`parallel_downloader.py`**: Parallel download management with thread-safe operations
- `ParallelDownloader` class: Manages concurrent downloads with configurable workers
- `DownloadTask` and `DownloadResult` dataclasses: Structured task and result management
- Thread-safe progress tracking and error handling
- Automatic retry mechanism for failed downloads
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation - **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
- **`song_validator.py`**: Centralized song validation logic for checking if songs should be downloaded - `sanitize_filename()`: Create safe filenames from artist/title
- `generate_possible_filenames()`: Generate filename patterns for different modes
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
- `is_valid_mp4_file()`: Validate MP4 files with header checking
- `cleanup_temp_files()`: Remove temporary yt-dlp files
- `ensure_directory_exists()`: Safe directory creation
### New Utility Modules (v3.4.7): - **`song_validator.py`**: Centralized song validation logic
- **`data_path_manager.py`**: Centralized data directory path management and file path resolution - `SongValidator` class: Unified logic for checking if songs should be downloaded
- `should_skip_song()`: Comprehensive validation with multiple criteria
- `mark_song_failed()`: Consistent failure tracking
- `handle_download_failure()`: Standardized error handling
### **Unified Download Workflow (v3.4.5)** - **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
- **`execute_unified_download_workflow()`**: Centralized download execution that all modes use - `ConfigManager` class: Type-safe configuration loading and caching
- **`_execute_sequential_downloads()`**: Sequential download execution using DownloadPipeline - `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
- **`_execute_parallel_downloads()`**: Parallel download execution using ParallelDownloader - Configuration validation and merging with defaults
- Dynamic resolution updates
### **Benefits of Enhanced Modular Architecture:** ### Benefits:
- **Single Responsibility**: Each module has a focused purpose
- **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized - **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
- **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules - **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules
- **Testability**: Individual components can be tested separately
- **Maintainability**: Easier to find and fix issues
- **Reusability**: Components can be used independently
- **Robustness**: Better error handling and interruption recovery
- **Consistency**: Standardized error messages and processing pipelines - **Consistency**: Standardized error messages and processing pipelines
- **Maintainability**: Changes isolated to specific modules
- **Testability**: Modular components can be tested independently
- **Type Safety**: Comprehensive type hints across all new modules - **Type Safety**: Comprehensive type hints across all new modules
- **Unified Execution**: All download modes use the same execution pipeline for consistency
## 🔧 Development Guidelines
### **Adding New Download Modes**
When adding new download modes, follow the unified workflow pattern to ensure consistency:
#### **1. Build Download Plan (Mode-Specific)**
```python
def download_new_mode(self, ...):
# Build download plan with standard structure
download_plan = []
for video in videos_to_download:
download_plan.append({
"video_id": video["id"],
"artist": artist,
"title": title,
"filename": filename,
"channel_name": channel_name,
"video_title": video["title"],
"force_download": force_download
})
# Use unified execution workflow
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file,
limit=limit,
show_progress=True,
)
return success
```
#### **2. Key Principles**
- **NEVER implement custom download execution logic** - always use `execute_unified_download_workflow()`
- **Focus on download plan building** - that's where mode-specific logic belongs
- **Use the standard download plan structure** for consistency
- **Implement cache file handling** for progress tracking and resume functionality
- **Test with both sequential and parallel modes** to ensure compatibility
#### **3. Benefits of Unified Architecture**
- **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- **Automatic Features**: New modes automatically get parallel downloads, progress tracking, and cache management
- **Maintainability**: Changes to download execution only need to be made in one place
- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
## 🔧 Recent Improvements (v3.4.1)
### **Enhanced Fuzzy Matching**
- **Improved title parsing**: Enhanced `extract_artist_title` function to handle multiple video title formats
- **Better matching accuracy**: Can now parse titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
- **Consistent parsing**: All modules now use the same parsing logic from `fuzzy_matcher.py`
- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
### **Fixed Import Conflicts**
- **Resolved import conflicts**: Updated modules to use the enhanced `extract_artist_title` from `fuzzy_matcher.py`
- **Consistent behavior**: All parts of the system use the same parsing logic
- **Cleaner codebase**: Eliminated duplicate code and import conflicts
### **Fixed --limit Parameter**
- **Correct limit application**: The `--limit` parameter now properly limits the scanning phase, not just downloads
- **Improved performance**: When using `--limit N`, only the first N songs are scanned, significantly reducing processing time
- **Accurate logging**: Logging messages now show the correct counts for songs that will actually be processed when using `--limit`
### **Code Quality Improvements**
- **Eliminated duplicate functions**: Removed duplicate `extract_artist_title` implementations
- **Fixed import conflicts**: Resolved inconsistencies between different parsing implementations
- **Single source of truth**: All title parsing logic is now centralized in `fuzzy_matcher.py`
## 🔧 Recent Improvements (v3.4.5)
### **Unified Download Workflow Architecture**
- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution
- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation
- **Unified cache management**: Consistent progress tracking and resume functionality across all modes
### **What Was Fixed**
- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies
- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented
- **Code Duplication**: Eliminated duplicate download execution logic across different modes
- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management
### **Benefits**
- ✅ **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- ✅ **Maintainability**: Changes to download execution only need to be made in one place
- ✅ **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
- ✅ **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
- ✅ **Testing**: Easier to test since all modes use the same execution logic
## 🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)
### **Duplicate File Prevention**
- **Enhanced file existence checking**: Now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
- **Automatic duplicate prevention**: Skips downloads when files already exist (including duplicates)
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files
- **Cleanup utility**: `data/cleanup_duplicate_files.py` helps identify and remove existing duplicate files
### **Filename vs ID3 Tag Consistency**
- **Consistent metadata**: Filename and ID3 tag now use identical artist/title format
- **Removed extra suffixes**: No more "(Karaoke Version)" in ID3 tags that don't match filenames
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction
### **Benefits**
- ✅ **No more duplicate files** with `(2)`, `(3)` suffixes
- ✅ **Consistent metadata** between filename and ID3 tags
- ✅ **Efficient disk usage** by preventing unnecessary downloads
- ✅ **Clear file identification** with consistent naming
### **Clean Up Existing Duplicates**
```bash
# Run the cleanup utility to find and remove existing duplicates
python data/cleanup_duplicate_files.py
# Choose option 1 for dry run (recommended first)
# Choose option 2 to actually delete duplicates
```
## 📋 Requirements ## 📋 Requirements
- **Windows 10/11 or macOS 10.14+** - **Windows 10/11, macOS 10.14+, or Linux**
- **Python 3.7+** - **Python 3.7+**
- **yt-dlp binary** (platform-specific, see setup instructions below) - **yt-dlp binary** (platform-specific, see setup instructions below)
- **mutagen** (for ID3 tagging, optional) - **mutagen** (for ID3 tagging, optional)
- **ffmpeg/ffprobe** (for video validation, optional but recommended) - **ffmpeg/ffprobe** (for video validation, optional but recommended)
- **rapidfuzz** (for fuzzy matching, optional, falls back to difflib) - **rapidfuzz** (for fuzzy matching, optional, falls back to difflib)
## 🍎 macOS Setup ## 🖥️ Platform Setup
### Automatic Setup (Recommended) ### Automatic Setup (Recommended)
Run the macOS setup script to automatically set up yt-dlp and FFmpeg: Run the platform setup script to automatically set up yt-dlp for your system:
```bash ```bash
python3 setup_macos.py python setup_platform.py
``` ```
This script will: This script will:
- Detect your macOS version - Detect your platform (Windows, macOS, or Linux)
- Offer installation options for yt-dlp (pip or binary download) - Offer two installation options:
- Install FFmpeg via Homebrew 1. **Download binary file** (recommended for most users)
2. **Install via pip** (alternative method)
- Make binaries executable (on Unix-like systems)
- Install FFmpeg (for optimal video processing)
- Test the installation - Test the installation
### Manual Setup ### Manual Setup
If you prefer to set up manually: If you prefer to set up manually:
#### Option 1: Install yt-dlp via pip #### Option 1: Download Binary Files
1. **Windows**: Download `yt-dlp.exe` from [yt-dlp releases](https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp.exe)
2. **macOS**: Download `yt-dlp_macos` from [yt-dlp releases](https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos)
3. **Linux**: Download `yt-dlp` from [yt-dlp releases](https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp)
Place the downloaded file in the `downloader/` directory and make it executable on Unix-like systems:
```bash ```bash
pip3 install yt-dlp chmod +x downloader/yt-dlp_macos # macOS
chmod +x downloader/yt-dlp # Linux
``` ```
#### Option 2: Download yt-dlp binary #### Option 2: Install via pip
```bash ```bash
mkdir -p downloader pip install yt-dlp
curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
chmod +x downloader/yt-dlp_macos
``` ```
#### Install FFmpeg The tool will automatically detect and use the pip-installed version on macOS.
```bash
brew install ffmpeg
```
### Test Installation **Note**: FFmpeg is also required for optimal video processing. The setup script will attempt to install it automatically, or you can install it manually:
```bash - **macOS**: `brew install ffmpeg`
python3 src/tests/test_macos.py - **Linux**: `sudo apt install ffmpeg` (Ubuntu/Debian) or `sudo yum install ffmpeg` (CentOS/RHEL)
``` - **Windows**: Download from [ffmpeg.org](https://ffmpeg.org/download.html)
## 🚀 Quick Start ## 🚀 Quick Start
@ -251,21 +143,6 @@ python3 src/tests/test_macos.py
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
``` ```
### Download ALL Videos from a Channel (Not Just Songlist Matches)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
```
### Download ALL Videos with Parallel Processing
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
```
### Download ALL Videos with Limit
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
```
### Download Only Songlist Songs (Fast Mode) ### Download Only Songlist Songs (Fast Mode)
```bash ```bash
python download_karaoke.py --songlist-only --limit 5 python download_karaoke.py --songlist-only --limit 5
@ -273,7 +150,7 @@ python download_karaoke.py --songlist-only --limit 5
### Download with Parallel Processing ### Download with Parallel Processing
```bash ```bash
python download_karaoke.py --parallel --songlist-only --limit 10 python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
``` ```
### Focus on Specific Playlists by Title ### Focus on Specific Playlists by Title
@ -281,31 +158,11 @@ python download_karaoke.py --parallel --songlist-only --limit 10
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100" python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"
``` ```
### Focus on Specific Playlists from Custom File
```bash
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
```
### Force Download from Channels (Bypass All Existing File Checks)
```bash
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
```
### Download with Fuzzy Matching ### Download with Fuzzy Matching
```bash ```bash
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85 python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
``` ```
### Test Download Plan (Dry Run)
```bash
python download_karaoke.py --songlist-only --limit 5 --dry-run
```
### Test Channel Download Plan (Dry Run)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run
```
### Download Latest N Videos Per Channel ### Download Latest N Videos Per Channel
```bash ```bash
python download_karaoke.py --latest-per-channel --limit 5 python download_karaoke.py --latest-per-channel --limit 5
@ -410,33 +267,23 @@ KaroakeVideoDownloader/
│ ├── check_resolution.py # Resolution checker utility │ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI │ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI │ └── tracking_cli.py # Tracking management CLI
├── config/ # Configuration files ├── data/ # All config, tracking, cache, and songlist files
│ └── config.json # Main configuration file │ ├── config.json
├── data/ # All tracking, cache, and songlist files
│ ├── karaoke_tracking.json │ ├── karaoke_tracking.json
│ ├── songlist_tracking.json │ ├── songlist_tracking.json
│ ├── channel_cache.json │ ├── channel_cache.json
│ ├── channels.json # Channel configuration with parsing rules │ ├── channels.txt
│ └── songList.json │ └── songList.json
├── utilities/ # Utility scripts and tools
│ ├── add_manual_video.py # Manual video management
│ ├── build_cache_from_raw.py # Cache building utility
│ ├── cleanup_duplicate_files.py # File cleanup utilities
│ ├── cleanup_recent_tracking.py # Tracking cleanup utilities
│ ├── deduplicate_songlist_tracking.py # Data deduplication
│ ├── fix_artist_name_format.py # Data cleanup utilities
│ ├── fix_artist_name_format_simple.py
│ ├── fix_code_quality.py # Development tools
│ ├── reset_and_redownload.py # Maintenance utilities
│ └── songlist_report.py # Reporting utilities
├── downloads/ # All video output ├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders │ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs ├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary (Windows) ├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos # yt-dlp binary (macOS) ├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
├── src/tests/ # Test scripts ├── downloader/yt-dlp # yt-dlp binary (Linux)
│ ├── test_macos.py # macOS setup and functionality tests ├── setup_platform.py # Platform setup script
│ └── test_platform.py # Platform detection tests ├── test_platform.py # Platform test script
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── download_karaoke.py # Main entry point (thin wrapper) ├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md ├── README.md
├── PRD.md ├── PRD.md
@ -453,7 +300,6 @@ KaroakeVideoDownloader/
- `--songlist-priority`: Prioritize songlist songs in download queue - `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist - `--songlist-only`: Download only songs from the songlist
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`) - `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
- `--songlist-file <FILE_PATH>`: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
- `--songlist-status`: Show songlist download progress - `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit) - `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution - `--resolution <720p|1080p|...>`: Override resolution
@ -465,14 +311,8 @@ KaroakeVideoDownloader/
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)** - `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available) - `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85) - `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
- `--parallel`: Enable parallel downloads for improved speed (defaults to 3 workers) - `--parallel`: Enable parallel downloads for improved speed
- `--workers <N>`: Number of parallel download workers (1-10, default: 3, only used with --parallel) - `--workers <N>`: Number of parallel download workers (1-10, default: 3)
- `--generate-songlist <DIR1> <DIR2>...`: **Generate song list from MP4 files with ID3 tags in specified directories**
- `--no-append-songlist`: **Create a new song list instead of appending when using --generate-songlist**
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files**
- `--dry-run`: **Build download plan and show what would be downloaded without actually downloading anything**
## 📝 Example Usage ## 📝 Example Usage
@ -483,61 +323,30 @@ KaroakeVideoDownloader/
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85 python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
# Parallel downloads for faster processing # Parallel downloads for faster processing
python download_karaoke.py --parallel --songlist-only --limit 10 python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
# Latest videos per channel with parallel downloads # Latest videos per channel with parallel downloads
python download_karaoke.py --parallel --latest-per-channel --limit 5 python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5
# Traditional full scan (no limit) # Traditional full scan (no limit)
python download_karaoke.py --songlist-only python download_karaoke.py --songlist-only
# Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
# Focus on specific playlists from a custom file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --limit 10
# Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
# Channel-specific operations # Channel-specific operations
python download_karaoke.py --reset-channel SingKingKaraoke python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates python download_karaoke.py --clear-server-duplicates
# Download ALL videos from a specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Song list generation from MP4 files
python download_karaoke.py --generate-songlist /path/to/mp4/directory
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
# Generate report of songs that couldn't be found
python download_karaoke.py --generate-unmatched-report
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
``` ```
## 🏷️ ID3 Tagging ## 🏷️ ID3 Tagging
- Adds artist/title/album/genre to MP4 files using mutagen (if installed) - Adds artist/title/album/genre to MP4 files using mutagen (if installed)
## 📋 Song List Generation
- **Generate song lists from existing MP4 files**: Use `--generate-songlist` to create song lists from directories containing MP4 files with ID3 tags
- **Automatic ID3 extraction**: Extracts artist and title from MP4 files' ID3 tags
- **Directory-based organization**: Each directory becomes a playlist with the directory name as the title
- **Position tracking**: Songs are numbered starting from 1 based on file order
- **Append or replace**: Choose to append to existing song list or create a new one with `--no-append-songlist`
- **Multiple directories**: Process multiple directories in a single command
## 🧹 Cleanup ## 🧹 Cleanup
- Removes `.info.json` and `.meta` files after download - Removes `.info.json` and `.meta` files after download
## 🛠️ Configuration ## 🛠️ Configuration
- All options are in `config/config.json` (format, resolution, metadata, etc.) - All options are in `data/config.json` (format, resolution, metadata, etc.)
- You can edit this file or use CLI flags to override - You can edit this file or use CLI flags to override
- **Configurable Data Directory**: The data directory path can be configured in `config/config.json` under `folder_structure.data_dir` (default: "data")
## 📋 Command Reference File ## 📋 Command Reference File
@ -553,32 +362,7 @@ python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-thr
> **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage. > **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
## 📚 Documentation Standards ## 🔧 Refactoring Improvements (v3.5)
### **Documentation Location**
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
- **Use the existing sections in PRD.md and README.md to track all project evolution**
### **Where to Document Changes**
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
- **README.md**: User-facing features, usage instructions, and high-level improvements
- **CHANGELOG.md**: Version-specific release notes and change summaries
### **Documentation Requirements**
- **All new features must be documented in both PRD.md and README.md**
- **All refactoring efforts must be documented in the appropriate sections**
- **All bug fixes must be documented with technical details**
- **Version numbers and dates should be clearly marked**
- **Benefits and improvements should be explicitly stated**
### **Maintenance Responsibility**
- **Keep PRD.md and README.md synchronized with code changes**
- **Update documentation immediately when implementing new features**
- **Remove outdated information and consolidate related changes**
- **Ensure all CLI options and features are documented in both files**
## 🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization: The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
### **New Utility Modules (v3.3)** ### **New Utility Modules (v3.3)**
@ -613,9 +397,20 @@ The codebase has been comprehensively refactored to improve maintainability and
- **Improved Testability**: Modular components can be tested independently - **Improved Testability**: Modular components can be tested independently
- **Better Developer Experience**: Clear function signatures and comprehensive documentation - **Better Developer Experience**: Clear function signatures and comprehensive documentation
### **Cross-Platform Support (v3.5)**
- **Platform detection:** Automatic detection of Windows, macOS, and Linux systems
- **Flexible yt-dlp integration:** Supports both binary files and pip-installed yt-dlp modules
- **Platform-specific configuration:** Automatic selection of appropriate yt-dlp binary/command for each platform
- **Setup automation:** `setup_platform.py` script for easy platform-specific setup
- **Command parsing:** Intelligent parsing of yt-dlp commands (file paths vs. module commands)
- **Enhanced documentation:** Platform-specific setup instructions and troubleshooting
- **Backward compatibility:** Maintains full compatibility with existing Windows installations
- **FFmpeg integration:** Automatic FFmpeg installation and configuration for optimal video processing
- **Optimized caching:** Enhanced channel video caching with format compatibility and instant video list loading
### **New Parallel Download System (v3.4)** ### **New Parallel Download System (v3.4)**
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management - **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
- **Configurable concurrency:** Use `--parallel` to enable parallel downloads with 3 workers by default, or `--parallel --workers N` for custom worker count (1-10) - **Configurable concurrency:** Use `--parallel --workers N` to enable parallel downloads with N workers (1-10)
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe - **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress - **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency - **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
@ -639,8 +434,11 @@ The codebase has been comprehensively refactored to improve maintainability and
- **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads. - **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads.
## 🐞 Troubleshooting ## 🐞 Troubleshooting
- **Platform-specific yt-dlp setup**:
- **Windows**: Ensure `yt-dlp.exe` is in the `downloader/` folder - **Windows**: Ensure `yt-dlp.exe` is in the `downloader/` folder
- **macOS**: Run `python3 setup_macos.py` to set up yt-dlp and FFmpeg - **macOS**: Either ensure `yt-dlp_macos` is in the `downloader/` folder (make executable with `chmod +x`) OR install via pip (`pip install yt-dlp`)
- **Linux**: Ensure `yt-dlp` is in the `downloader/` folder (make executable with `chmod +x`)
- Run `python setup_platform.py` to automatically set up yt-dlp for your platform
- Check `logs/` for error details - Check `logs/` for error details
- Use `python -m karaoke_downloader.check_resolution` to verify video quality - Use `python -m karaoke_downloader.check_resolution` to verify video quality
- If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH - If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH

View File

@ -1,6 +1,6 @@
# 🎤 Karaoke Video Downloader - CLI Commands Reference # 🎤 Karaoke Video Downloader - CLI Commands Reference
# Copy and paste these commands into your terminal # Copy and paste these commands into your terminal
# Updated: v3.4.4 (includes macOS support, all videos download mode, manual video collection, channel parsing rules, and all previous improvements) # Updated: v3.5 (includes cross-platform support, optimized caching, and all refactoring improvements)
## 📥 BASIC DOWNLOADS ## 📥 BASIC DOWNLOADS
@ -8,7 +8,7 @@
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
# Download from a file containing multiple channel URLs # Download from a file containing multiple channel URLs
python download_karaoke.py --file data/channels.json python download_karaoke.py --file data/channels.txt
# Download with custom resolution (480p, 720p, 1080p, 1440p, 2160p) # Download with custom resolution (480p, 720p, 1080p, 1440p, 2160p)
python download_karaoke.py --resolution 1080p https://www.youtube.com/@SingKingKaraoke/videos python download_karaoke.py --resolution 1080p https://www.youtube.com/@SingKingKaraoke/videos
@ -19,69 +19,9 @@ python download_karaoke.py --limit 10 https://www.youtube.com/@SingKingKaraoke/v
# Enable parallel downloads for faster processing (3-5x speedup) # Enable parallel downloads for faster processing (3-5x speedup)
python download_karaoke.py --parallel --workers 5 --limit 10 https://www.youtube.com/@SingKingKaraoke/videos python download_karaoke.py --parallel --workers 5 --limit 10 https://www.youtube.com/@SingKingKaraoke/videos
## 🎤 MANUAL VIDEO COLLECTION (v3.4.3)
# Download from manual videos collection (data/manual_videos.json)
python download_karaoke.py --manual --limit 5
# Download manual videos with fuzzy matching
python download_karaoke.py --manual --fuzzy-match --fuzzy-threshold 85 --limit 10
# Download manual videos with parallel processing
python download_karaoke.py --parallel --workers 3 --manual --limit 5
# Download manual videos with songlist matching
python download_karaoke.py --manual --songlist-only --limit 10
# Force download from manual videos (bypass existing file checks)
python download_karaoke.py --manual --force --limit 5
# Add a video to manual collection (interactive)
python utilities/add_manual_video.py add "Artist - Song Title (Karaoke Version)" "https://www.youtube.com/watch?v=VIDEO_ID"
# List all manual videos
python utilities/add_manual_video.py list
# Remove a video from manual collection
python utilities/add_manual_video.py remove "Artist - Song Title (Karaoke Version)"
## 🎬 ALL VIDEOS DOWNLOAD MODE (v3.4.4)
# Download ALL videos from a specific channel (not just songlist matches)
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
# Download ALL videos with parallel processing for speed
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
# Download ALL videos with limit (download first N videos)
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Download ALL videos with parallel processing and limit
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 5 --limit 50
# Download ALL videos from ZoomKaraokeOfficial channel
python download_karaoke.py --channel-focus ZoomKaraokeOfficial --all-videos
# Download ALL videos with custom resolution
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --resolution 1080p
## 📋 SONG LIST GENERATION
# Generate song list from MP4 files in a directory (append to existing song list)
python download_karaoke.py --generate-songlist /path/to/mp4/directory
# Generate song list from multiple directories
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 /path/to/dir3
# Generate song list and create a new song list file (don't append)
python download_karaoke.py --generate-songlist /path/to/mp4/directory --no-append-songlist
# Generate song list from multiple directories and create new file
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
## 🎵 SONGLIST OPERATIONS ## 🎵 SONGLIST OPERATIONS
# Download only songs from your songlist (uses data/channels.json by default) # Download only songs from your songlist (uses data/channels.txt by default)
python download_karaoke.py --songlist-only python download_karaoke.py --songlist-only
# Download only songlist songs with limit # Download only songlist songs with limit
@ -111,18 +51,6 @@ python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --limit 5
# Focus on specific playlists with parallel processing # Focus on specific playlists with parallel processing
python download_karaoke.py --parallel --workers 3 --songlist-focus "2025 - Apple Top 50" --limit 5 python download_karaoke.py --parallel --workers 3 --songlist-focus "2025 - Apple Top 50" --limit 5
# Focus on specific playlists from a custom songlist file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
# Focus on specific playlists from a custom file with force mode
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --force
# Force download from channels regardless of existing files or server duplicates
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
# Force download with parallel processing
python download_karaoke.py --parallel --workers 5 --songlist-focus "2025 - Apple Top 50" --force --limit 10
# Prioritize songlist songs in download queue (default behavior) # Prioritize songlist songs in download queue (default behavior)
python download_karaoke.py --songlist-priority https://www.youtube.com/@SingKingKaraoke/videos python download_karaoke.py --songlist-priority https://www.youtube.com/@SingKingKaraoke/videos
@ -132,35 +60,6 @@ python download_karaoke.py --no-songlist-priority https://www.youtube.com/@SingK
# Show songlist download status and statistics # Show songlist download status and statistics
python download_karaoke.py --songlist-status python download_karaoke.py --songlist-status
## 📊 UNMATCHED SONGS REPORTS
# Generate report of songs that couldn't be found in any channel (standalone)
python download_karaoke.py --generate-unmatched-report
# Generate report with fuzzy matching enabled (standalone)
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
# Generate report using a specific channel file (standalone)
python download_karaoke.py --generate-unmatched-report --file data/my_channels.txt
# Generate report from a custom songlist file (standalone)
python download_karaoke.py --generate-unmatched-report --songlist-file "data/my_custom_songlist.json"
# Generate report with focus on specific playlists from a custom file (standalone)
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --generate-unmatched-report
# Download songs AND generate unmatched report (additive feature)
python download_karaoke.py --songlist-only --limit 10 --generate-unmatched-report
# Download with fuzzy matching AND generate unmatched report
python download_karaoke.py --songlist-only --fuzzy-match --fuzzy-threshold 85 --limit 10 --generate-unmatched-report
# Download from specific playlists AND generate unmatched report
python download_karaoke.py --songlist-focus "CCKaraoke" --limit 10 --generate-unmatched-report
# Generate report with custom fuzzy threshold
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 80
## ⚡ PARALLEL DOWNLOADS (v3.4) ## ⚡ PARALLEL DOWNLOADS (v3.4)
# Basic parallel downloads (3-5x faster than sequential) # Basic parallel downloads (3-5x faster than sequential)
@ -195,7 +94,7 @@ python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5
python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85 python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
# Download latest videos from specific channels file # Download latest videos from specific channels file
python download_karaoke.py --latest-per-channel --limit 5 --file data/channels.json python download_karaoke.py --latest-per-channel --limit 5 --file data/channels.txt
## 🔄 CACHE & TRACKING MANAGEMENT ## 🔄 CACHE & TRACKING MANAGEMENT
@ -254,7 +153,7 @@ python download_karaoke.py --version
python download_karaoke.py --songlist-only --limit 20 --fuzzy-match --fuzzy-threshold 85 --resolution 1080p python download_karaoke.py --songlist-only --limit 20 --fuzzy-match --fuzzy-threshold 85 --resolution 1080p
# Latest videos per channel with fuzzy matching # Latest videos per channel with fuzzy matching
python download_karaoke.py --latest-per-channel --limit 3 --fuzzy-match --fuzzy-threshold 90 --file data/channels.json python download_karaoke.py --latest-per-channel --limit 3 --fuzzy-match --fuzzy-threshold 90 --file data/channels.txt
# Force refresh everything and download songlist # Force refresh everything and download songlist
python download_karaoke.py --songlist-only --force-download-plan --refresh --limit 10 python download_karaoke.py --songlist-only --force-download-plan --refresh --limit 10
@ -273,9 +172,6 @@ python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
# 1b. Focus on specific playlists (fast targeted download) # 1b. Focus on specific playlists (fast targeted download)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --limit 5 python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --limit 5
# 1c. Force download from specific playlists (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --limit 5
# 2. Latest videos from all channels # 2. Latest videos from all channels
python download_karaoke.py --latest-per-channel --limit 5 python download_karaoke.py --latest-per-channel --limit 5
@ -294,9 +190,6 @@ python download_karaoke.py --parallel --workers 5 --songlist-only --fuzzy-match
# 4b. Focused fuzzy matching (target specific playlists with flexible matching) # 4b. Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10 python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
# 4c. Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
# 5. Reset and start fresh # 5. Reset and start fresh
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
@ -304,38 +197,27 @@ python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --status python download_karaoke.py --status
python download_karaoke.py --clear-cache all python download_karaoke.py --clear-cache all
# 7. Download from manual video collection ## 🌐 PLATFORM SETUP COMMANDS (v3.5)
python download_karaoke.py --manual --limit 5
# 7b. Fast parallel manual video download # Automatic platform setup (detects OS and installs yt-dlp + FFmpeg)
python download_karaoke.py --parallel --workers 3 --manual --limit 5 python setup_platform.py
# 7c. Manual videos with fuzzy matching # Test platform detection and yt-dlp integration
python download_karaoke.py --manual --fuzzy-match --fuzzy-threshold 85 --limit 10 python test_platform.py
## 🍎 macOS SETUP COMMANDS # Manual platform-specific setup
# Windows: Download yt-dlp.exe to downloader/ folder
# Automatic macOS setup (detects OS and installs yt-dlp + FFmpeg) # macOS: brew install ffmpeg && pip install yt-dlp
python3 setup_macos.py # Linux: sudo apt install ffmpeg && download yt-dlp to downloader/ folder
# Test macOS setup and functionality
python3 src/tests/test_macos.py
# Manual macOS setup options
# Install yt-dlp via pip
pip3 install yt-dlp
# Download yt-dlp binary for macOS
mkdir -p downloader && curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos && chmod +x downloader/yt-dlp_macos
# Install FFmpeg via Homebrew
brew install ffmpeg
## 🔧 TROUBLESHOOTING COMMANDS ## 🔧 TROUBLESHOOTING COMMANDS
# Check if everything is working # Check if everything is working
python download_karaoke.py --version python download_karaoke.py --version
# Test platform setup
python test_platform.py
# Force refresh everything # Force refresh everything
python download_karaoke.py --force-download-plan --refresh --clear-cache all python download_karaoke.py --force-download-plan --refresh --clear-cache all
@ -346,9 +228,7 @@ python download_karaoke.py --clear-server-duplicates
## 📝 NOTES ## 📝 NOTES
# Default files used: # Default files used:
# - data/channels.json (channel configuration with parsing rules, preferred) # - data/channels.txt (default channel list for songlist modes)
# - data/channels.json (channel configuration with parsing rules)
# - data/manual_videos.json (manual video collection)
# - data/songList.json (your prioritized song list) # - data/songList.json (your prioritized song list)
# - data/config.json (download settings) # - data/config.json (download settings)
@ -357,12 +237,11 @@ python download_karaoke.py --clear-server-duplicates
# Fuzzy threshold: 0-100 (higher = more strict matching, default 90) # Fuzzy threshold: 0-100 (higher = more strict matching, default 90)
# The system automatically: # The system automatically:
# - Uses data/channels.json for channel configuration and parsing rules # - Uses data/channels.txt if no --file specified in songlist modes
# - Caches channel data for 24 hours (configurable) # - Caches channel data for 24 hours (configurable)
# - Tracks all downloads in JSON files # - Tracks all downloads in JSON files
# - Avoids re-downloading existing files # - Avoids re-downloading existing files
# - Checks for server duplicates # - Checks for server duplicates
# - Supports manual video collection via --manual parameter
# For best performance: # For best performance:
# - Use --parallel --workers 5 for 3-5x faster downloads # - Use --parallel --workers 5 for 3-5x faster downloads
@ -370,7 +249,8 @@ python download_karaoke.py --clear-server-duplicates
# - Use --fuzzy-match for better song discovery # - Use --fuzzy-match for better song discovery
# - Use --refresh sparingly (forces re-scan) # - Use --refresh sparingly (forces re-scan)
# - Clear cache if you encounter issues # - Clear cache if you encounter issues
# - macOS users: Run `python3 setup_macos.py` for automatic setup # - Channel caching provides instant video list loading (no YouTube API calls)
# - FFmpeg integration ensures optimal video processing and merging
# Parallel download tips: # Parallel download tips:
# - Start with --workers 3 for conservative approach # - Start with --workers 3 for conservative approach

4022
data/bak_songList.json Normal file

File diff suppressed because it is too large Load Diff

164578
data/channel_cache.json Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,19 +0,0 @@
{
"channel_id": "@LetsSingKaraoke",
"videos": [
{
"title": "Sub Urban - Cradles | Karaoke (instrumental)",
"id": "8uj7IzhdiO4"
},
{
"title": "Sia - Snowman | Karaoke (instrumental)",
"id": "ZbWHuncTgsM"
},
{
"title": "Trevor Daniel - Falling | Karaoke (Instrumental)",
"id": "nU7n2aq7f98"
}
],
"last_updated": "2025-08-05T15:59:09.280488",
"video_count": 3
}

View File

@ -1,10 +0,0 @@
# Raw yt-dlp output for @LetsSingKaraoke
# Channel URL: https://www.youtube.com/@LetsSingKaraoke/videos
# Command: downloader/yt-dlp_macos --flat-playlist --print %(title)s|%(id)s|%(url)s --verbose https://www.youtube.com/@LetsSingKaraoke/videos
# Timestamp: 2025-08-05T15:59:09.280155
# Total lines: 3
################################################################################
1: Sub Urban - Cradles | Karaoke (instrumental)|8uj7IzhdiO4|https://www.youtube.com/watch?v=8uj7IzhdiO4
2: Sia - Snowman | Karaoke (instrumental)|ZbWHuncTgsM|https://www.youtube.com/watch?v=ZbWHuncTgsM
3: Trevor Daniel - Falling | Karaoke (Instrumental)|nU7n2aq7f98|https://www.youtube.com/watch?v=nU7n2aq7f98

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,191 +0,0 @@
{
"channels": [
{
"name": "@SingKingKaraoke",
"url": "https://www.youtube.com/@SingKingKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "Karaoke Version"]
}
},
"examples": [
"Artist - Title (Karaoke)",
"Artist - Title (Karaoke Version)"
]
},
"description": "Standard artist - title format with karaoke suffix"
},
{
"name": "@KaraokeOnVEVO",
"url": "https://www.youtube.com/@KaraokeOnVEVO/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)"]
}
},
"examples": [
"George Jones - A Picture Of Me (Without You) (Karaoke)",
"Iggy Pop, Kate Pierson - Candy (Karaoke)"
]
},
"description": "Standard artist - title format with (Karaoke) suffix"
},
{
"name": "@StingrayKaraoke",
"url": "https://www.youtube.com/@StingrayKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke Version)"]
}
},
"playlist_indicators": [
"TOP SONGS OF",
"THE BEST",
"BEST",
"NON-STOP",
"MASHUP",
"FEAT.",
"WITH LYRICS"
],
"examples": [
"Gracie Abrams - That's So True (Karaoke Version)",
"TOP SONGS OF 2024 KARAOKE WITH LYRICS BY BILLIE EILISH, GRACIE ABRAMS & MORE"
]
},
"description": "Standard artist - title format with (Karaoke Version) suffix, also has playlist titles"
},
{
"name": "@sing2karaoke",
"url": "https://www.youtube.com/@sing2karaoke/videos",
"parsing_rules": {
"format": "artist_title_spaces",
"separator": " ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke Version) Lyrics", "(Karaoke Version)", "Karaoke Version Lyrics"]
}
},
"multi_artist_separator": ", ",
"examples": [
"Lauren Spencer Smith Fingers Crossed",
"Calvin Harris, Clementine Douglas Blessings (Karaoke Version) Lyrics"
]
},
"description": "Artist and title separated by multiple spaces, supports multiple artists"
},
{
"name": "@ZoomKaraokeOfficial",
"url": "https://www.youtube.com/@ZoomKaraokeOfficial/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": [
"(Karaoke)",
"(Karaoke Version)",
"Karaoke Version",
"- Karaoke Version from Zoom Karaoke",
"- Karaoke Version from Zoom",
"- Karaoke Version from Zoom Karaoke (Radiohead Cover)",
"- Karaoke Version from Zoom (Radiohead Cover)"
]
}
},
"examples": [
"The Mavericks - Here Comes My Baby - Karaoke Version from Zoom Karaoke"
]
},
"description": "Standard artist - title format with '- Karaoke Version from Zoom Karaoke' suffix"
},
{
"name": "@VocalStarKaraoke",
"url": "https://www.youtube.com/@VocalStarKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": false,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["KARAOKE Without Backing Vocals", "KARAOKE With Vocal Guide", "KARAOKE"]
}
},
"examples": [
"Don't Say You Love Me - Jin KARAOKE Without Backing Vocals",
"Don't Say You Love Me - Jin KARAOKE With Vocal Guide"
]
},
"description": "Title first, then dash separator, then artist with KARAOKE suffix"
},
{
"name": "@ManualVideos",
"url": "manual://static",
"manual_videos_file": "data/manual_videos.json",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "(Karaoke Version) Lyrics"]
}
}
},
"description": "Manual collection of individual karaoke videos (static, never expires)"
},
{
"name": "Let's Sing Karaoke",
"url": "https://www.youtube.com/@LetsSingKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "Karaoke Version", "(In the style of)"]
}
},
"examples": [
"Artist - Title (Karaoke)",
"Artist - Title (In the style of Other Artist)"
]
},
"artist_name_processing": true,
"description": "Let's Sing Karaoke with enhanced artist name processing"
}
],
"global_parsing_settings": {
"fallback_format": "artist_title_separator",
"fallback_separator": " - ",
"common_suffixes": [
"(Karaoke)",
"(Karaoke Version)",
"Karaoke Version",
"(Karaoke Version) Lyrics",
"Karaoke Version Lyrics"
],
"playlist_indicators": [
"TOP",
"BEST",
"MASHUP",
"FEAT.",
"WITH LYRICS",
"NON-STOP",
"PLAYLIST"
]
}
}

7
data/channels.txt Normal file
View File

@ -0,0 +1,7 @@
https://www.youtube.com/@SingKingKaraoke/videos
https://www.youtube.com/@karafun/videos
https://www.youtube.com/@KaraokeOnVEVO/videos
https://www.youtube.com/@StingrayKaraoke/videos
https://www.youtube.com/@CCKaraoke/videos
https://www.youtube.com/@AtomicKaraoke/videos
https://www.youtube.com/@sing2karaoke/videos

View File

@ -2,11 +2,7 @@ import json
from pathlib import Path from pathlib import Path
from datetime import datetime, time from datetime import datetime, time
from karaoke_downloader.data_path_manager import get_data_path_manager def cleanup_recent_tracking(tracking_path="data/songlist_tracking.json", cutoff_time_str="11:00"):
def cleanup_recent_tracking(tracking_path=None, cutoff_time_str="11:00"):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
"""Remove entries from songlist_tracking.json that were added after the specified time today.""" """Remove entries from songlist_tracking.json that were added after the specified time today."""
tracking_file = Path(tracking_path) tracking_file = Path(tracking_path)
if not tracking_file.exists(): if not tracking_file.exists():

View File

@ -19,14 +19,13 @@
"writethumbnail": false, "writethumbnail": false,
"embed_metadata": false, "embed_metadata": false,
"continuedl": true, "continuedl": true,
"nooverwrites": false, "nooverwrites": true,
"ignoreerrors": true, "ignoreerrors": true,
"no_warnings": false "no_warnings": false
}, },
"folder_structure": { "folder_structure": {
"downloads_dir": "downloads", "downloads_dir": "downloads",
"logs_dir": "logs", "logs_dir": "logs",
"data_dir": "data",
"tracking_file": "downloaded_videos.json" "tracking_file": "downloaded_videos.json"
}, },
"logging": { "logging": {
@ -39,7 +38,8 @@
"auto_detect_platform": true, "auto_detect_platform": true,
"yt_dlp_paths": { "yt_dlp_paths": {
"windows": "downloader/yt-dlp.exe", "windows": "downloader/yt-dlp.exe",
"macos": "downloader/yt-dlp_macos" "macos": "python3 -m yt_dlp",
"linux": "downloader/yt-dlp"
} }
}, },
"yt_dlp_path": "downloader/yt-dlp.exe" "yt_dlp_path": "downloader/yt-dlp.exe"

File diff suppressed because it is too large Load Diff

View File

@ -1,85 +0,0 @@
{
"channel_name": "@ManualVideos",
"channel_url": "manual://static",
"description": "Manual collection of individual karaoke videos",
"videos": [
{
"title": "Nickelback - Photograph",
"url": "https://www.youtube.com/watch?v=qZXwpceqt9s",
"id": "qZXwpceqt9s",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Ed Sheeran & Beyoncé - Perfect Duet",
"url": "https://www.youtube.com/watch?v=qegLWI99Wg0",
"id": "qegLWI99Wg0",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "10,000 Maniacs - More Than This",
"url": "https://www.youtube.com/watch?v=wxnuF-APJ5M",
"id": "wxnuF-APJ5M",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "AC/DC - Big Balls",
"url": "https://www.youtube.com/watch?v=kiSDpVmu4Bk",
"id": "kiSDpVmu4Bk",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Jon Bon Jovi - Blaze of Glory",
"url": "https://www.youtube.com/watch?v=SzRAoDMlQY",
"id": "SzRAoDMlQY",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "ZZ Top - Sharp Dressed Man",
"url": "https://www.youtube.com/watch?v=prRalwto9iY",
"id": "prRalwto9iY",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Nickelback - Photograph",
"url": "https://www.youtube.com/watch?v=qTphCTAUhUg",
"id": "qTphCTAUhUg",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Billy Joel - Shes Got A Way",
"url": "https://www.youtube.com/watch?v=DeeTFIgKuC8",
"id": "DeeTFIgKuC8",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
}
],
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": [
"(Karaoke)",
"(Karaoke Version)",
"(Karaoke Version) Lyrics"
]
}
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -1,15 +1,11 @@
import json import json
from pathlib import Path from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
def normalize_title(title): def normalize_title(title):
normalized = title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip() normalized = title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip()
return " ".join(normalized.split()).lower() return " ".join(normalized.split()).lower()
def load_songlist(songlist_path=None): def load_songlist(songlist_path="data/songList.json"):
if songlist_path is None:
songlist_path = str(get_data_path_manager().get_songlist_path())
songlist_file = Path(songlist_path) songlist_file = Path(songlist_path)
if not songlist_file.exists(): if not songlist_file.exists():
print(f"⚠️ Songlist file not found: {songlist_path}") print(f"⚠️ Songlist file not found: {songlist_path}")
@ -28,18 +24,14 @@ def load_songlist(songlist_path=None):
}) })
return all_songs return all_songs
def load_songlist_tracking(tracking_path=None): def load_songlist_tracking(tracking_path="data/songlist_tracking.json"):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
tracking_file = Path(tracking_path) tracking_file = Path(tracking_path)
if not tracking_file.exists(): if not tracking_file.exists():
return {} return {}
with open(tracking_file, 'r', encoding='utf-8') as f: with open(tracking_file, 'r', encoding='utf-8') as f:
return json.load(f) return json.load(f)
def load_server_songs(songs_path=None): def load_server_songs(songs_path="data/songs.json"):
if songs_path is None:
songs_path = str(get_data_path_manager().get_songs_path())
"""Load the list of songs already available on the server.""" """Load the list of songs already available on the server."""
songs_file = Path(songs_path) songs_file = Path(songs_path)
if not songs_file.exists(): if not songs_file.exists():

File diff suppressed because it is too large Load Diff

BIN
downloader/yt-dlp Normal file

Binary file not shown.

View File

@ -9,8 +9,6 @@ import json
from datetime import datetime, timedelta from datetime import datetime, timedelta
from pathlib import Path from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
# Constants # Constants
DEFAULT_CACHE_EXPIRATION_DAYS = 1 DEFAULT_CACHE_EXPIRATION_DAYS = 1
DEFAULT_CACHE_FILENAME_LENGTH_LIMIT = 200 # Increased from 60 DEFAULT_CACHE_FILENAME_LENGTH_LIMIT = 200 # Increased from 60
@ -39,7 +37,7 @@ def get_download_plan_cache_file(mode, **kwargs):
+ hashlib.md5(base.encode()).hexdigest()[:8] + hashlib.md5(base.encode()).hexdigest()[:8]
) )
return get_data_path_manager().get_path(f"{base}.json") return Path(f"data/{base}.json")
def load_cached_plan(cache_file, max_age_days=DEFAULT_CACHE_EXPIRATION_DAYS): def load_cached_plan(cache_file, max_age_days=DEFAULT_CACHE_EXPIRATION_DAYS):

View File

@ -1,260 +0,0 @@
"""
Channel-specific parsing utilities for extracting artist and title from video titles.
This module handles the different title formats used by various karaoke channels,
providing channel-specific parsing rules to extract artist and title information
correctly for ID3 tagging and filename generation.
"""
import json
import re
from typing import Dict, List, Optional, Tuple, Any
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
class ChannelParser:
"""Handles channel-specific parsing of video titles to extract artist and title."""
def __init__(self, channels_file: str = None):
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_json_path())
"""Initialize the parser with channel configuration."""
self.channels_file = Path(channels_file)
self.channels_config = self._load_channels_config()
def _load_channels_config(self) -> Dict[str, Any]:
"""Load the channels configuration from JSON file."""
if not self.channels_file.exists():
raise FileNotFoundError(f"Channels configuration file not found: {self.channels_file}")
with open(self.channels_file, 'r', encoding='utf-8') as f:
return json.load(f)
def get_channel_config(self, channel_name: str) -> Optional[Dict[str, Any]]:
"""Get the configuration for a specific channel."""
for channel in self.channels_config.get("channels", []):
if channel["name"] == channel_name:
return channel
return None
def extract_artist_title(self, video_title: str, channel_name: str) -> Tuple[str, str]:
"""
Extract artist and title from a video title using channel-specific parsing rules.
Args:
video_title: The full video title from YouTube
channel_name: The name of the channel (must match config)
Returns:
Tuple of (artist, title) - both may be empty strings if parsing fails
"""
channel_config = self.get_channel_config(channel_name)
if not channel_config:
# Fallback to global settings
return self._fallback_parse(video_title)
parsing_rules = channel_config.get("parsing_rules", {})
format_type = parsing_rules.get("format", "artist_title_separator")
if format_type == "artist_title_separator":
return self._parse_artist_title_separator(video_title, parsing_rules)
elif format_type == "artist_title_spaces":
return self._parse_artist_title_spaces(video_title, parsing_rules)
elif format_type == "title_artist_pipe":
return self._parse_title_artist_pipe(video_title, parsing_rules)
else:
return self._fallback_parse(video_title)
def _parse_artist_title_separator(self, video_title: str, rules: Dict[str, Any]) -> Tuple[str, str]:
"""Parse format: 'Artist - Title' or 'Title - Artist'."""
separator = rules.get("separator", " - ")
artist_first = rules.get("artist_first", True)
if separator not in video_title:
return "", video_title.strip()
parts = video_title.split(separator, 1)
if len(parts) != 2:
return "", video_title.strip()
part1, part2 = parts[0].strip(), parts[1].strip()
# Apply cleanup to both parts
part1_clean = self._cleanup_title(part1, rules.get("title_cleanup", {}))
part2_clean = self._cleanup_title(part2, rules.get("title_cleanup", {}))
if artist_first:
return part1_clean, part2_clean
else:
return part2_clean, part1_clean
def _parse_artist_title_spaces(self, video_title: str, rules: Dict[str, Any]) -> Tuple[str, str]:
"""Parse format: 'Artist Title' (multiple spaces)."""
separator = rules.get("separator", " ")
multi_artist_sep = rules.get("multi_artist_separator", ", ")
# Try multiple space patterns to handle inconsistent spacing
# Look for the LAST occurrence of multiple spaces to handle cases with commas
space_patterns = [" ", " ", " "] # 3, 2, 4 spaces
for pattern in space_patterns:
if pattern in video_title:
# Split on the LAST occurrence of the pattern
last_index = video_title.rfind(pattern)
if last_index != -1:
artist_part = video_title[:last_index].strip()
title_part = video_title[last_index + len(pattern):].strip()
# Handle multiple artists (e.g., "Artist1, Artist2")
if multi_artist_sep in artist_part:
# Keep the full artist string as is
artist = artist_part
else:
artist = artist_part
title = self._cleanup_title(title_part, rules.get("title_cleanup", {}))
return artist, title
# Try dash patterns as fallback for inconsistent formatting
dash_patterns = [" - ", " ", " -"] # Regular dash, en dash, dash without trailing space
for pattern in dash_patterns:
if pattern in video_title:
# Split on the LAST occurrence of the pattern
last_index = video_title.rfind(pattern)
if last_index != -1:
artist_part = video_title[:last_index].strip()
title_part = video_title[last_index + len(pattern):].strip()
# Handle multiple artists (e.g., "Artist1, Artist2")
if multi_artist_sep in artist_part:
# Keep the full artist string as is
artist = artist_part
else:
artist = artist_part
title = self._cleanup_title(title_part, rules.get("title_cleanup", {}))
return artist, title
# If no pattern matches, return empty artist and full title
return "", video_title.strip()
def _parse_title_artist_pipe(self, video_title: str, rules: Dict[str, Any]) -> Tuple[str, str]:
"""Parse format: 'Title | Artist'."""
separator = rules.get("separator", " | ")
if separator not in video_title:
return "", video_title.strip()
parts = video_title.split(separator, 1)
if len(parts) != 2:
return "", video_title.strip()
title_part, artist_part = parts[0].strip(), parts[1].strip()
title = self._cleanup_title(title_part, rules.get("title_cleanup", {}))
artist = self._cleanup_title(artist_part, rules.get("artist_cleanup", {}))
return artist, title
def _cleanup_title(self, text: str, cleanup_rules: Dict[str, Any]) -> str:
"""Apply cleanup rules to remove suffixes and normalize text."""
if not cleanup_rules:
return text.strip()
cleaned = text.strip()
# Handle remove_suffix rule
if "remove_suffix" in cleanup_rules:
suffixes = cleanup_rules["remove_suffix"].get("suffixes", [])
for suffix in suffixes:
if cleaned.endswith(suffix):
cleaned = cleaned[:-len(suffix)].strip()
break
return cleaned
def _fallback_parse(self, video_title: str) -> Tuple[str, str]:
"""Fallback parsing using global settings."""
global_settings = self.channels_config.get("global_parsing_settings", {})
fallback_format = global_settings.get("fallback_format", "artist_title_separator")
fallback_separator = global_settings.get("fallback_separator", " - ")
if fallback_format == "artist_title_separator":
if fallback_separator in video_title:
parts = video_title.split(fallback_separator, 1)
if len(parts) == 2:
artist = parts[0].strip()
title = parts[1].strip()
# Apply global suffix cleanup
for suffix in global_settings.get("common_suffixes", []):
if title.endswith(suffix):
title = title[:-len(suffix)].strip()
break
return artist, title
# If all else fails, return empty artist and full title
return "", video_title.strip()
def is_playlist_title(self, video_title: str, channel_name: str) -> bool:
"""Check if a video title appears to be a playlist rather than a single song."""
channel_config = self.get_channel_config(channel_name)
if not channel_config:
return self._is_playlist_by_global_rules(video_title)
parsing_rules = channel_config.get("parsing_rules", {})
playlist_indicators = parsing_rules.get("playlist_indicators", [])
if not playlist_indicators:
return self._is_playlist_by_global_rules(video_title)
title_upper = video_title.upper()
for indicator in playlist_indicators:
if indicator.upper() in title_upper:
return True
return False
def _is_playlist_by_global_rules(self, video_title: str) -> bool:
"""Check if title is a playlist using global rules."""
global_settings = self.channels_config.get("global_parsing_settings", {})
playlist_indicators = global_settings.get("playlist_indicators", [])
title_upper = video_title.upper()
for indicator in playlist_indicators:
if indicator.upper() in title_upper:
return True
return False
def get_all_channel_names(self) -> List[str]:
"""Get a list of all configured channel names."""
return [channel["name"] for channel in self.channels_config.get("channels", [])]
def get_channel_url(self, channel_name: str) -> Optional[str]:
"""Get the URL for a specific channel."""
channel_config = self.get_channel_config(channel_name)
return channel_config.get("url") if channel_config else None
# Convenience function for backward compatibility
def extract_artist_title(video_title: str, channel_name: str, channels_file: str = None) -> Tuple[str, str]:
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_json_path())
"""
Convenience function to extract artist and title from a video title.
Args:
video_title: The full video title from YouTube
channel_name: The name of the channel
channels_file: Path to the channels configuration file
Returns:
Tuple of (artist, title)
"""
parser = ChannelParser(channels_file)
return parser.extract_artist_title(video_title, channel_name)

View File

@ -1,117 +1,27 @@
#!/usr/bin/env python3
"""
Karaoke Video Downloader CLI
Command-line interface for the karaoke video downloader.
"""
import argparse import argparse
import os import os
import sys import sys
from pathlib import Path
from typing import List
from karaoke_downloader.channel_parser import ChannelParser from pathlib import Path
from karaoke_downloader.config_manager import AppConfig
from karaoke_downloader.data_path_manager import get_data_path_manager
from karaoke_downloader.downloader import KaraokeDownloader from karaoke_downloader.downloader import KaraokeDownloader
# Constants # Constants
DEFAULT_LATEST_PER_CHANNEL_LIMIT = 10
DEFAULT_FUZZY_THRESHOLD = 85 DEFAULT_FUZZY_THRESHOLD = 85
DEFAULT_LATEST_PER_CHANNEL_LIMIT = 5
DEFAULT_DISPLAY_LIMIT = 10
def load_channels_from_json(channels_file: str = None) -> List[str]: DEFAULT_CACHE_DURATION_HOURS = 24
"""
Load channel URLs from the new JSON format.
Args:
channels_file: Path to the channels.json file (if None, uses default from config)
Returns:
List of channel URLs
"""
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_json_path())
try:
parser = ChannelParser(channels_file)
channels = parser.channels_config.get("channels", [])
return [channel["url"] for channel in channels]
except Exception as e:
print(f"❌ Error loading channels from {channels_file}: {e}")
return []
def load_channels_from_text(channels_file: str = None) -> List[str]:
"""
Load channel URLs from the old text format (for backward compatibility).
Args:
channels_file: Path to the channels.txt file (if None, uses default from config)
Returns:
List of channel URLs
"""
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_txt_path())
try:
with open(channels_file, "r", encoding="utf-8") as f:
return [
line.strip()
for line in f
if line.strip() and not line.strip().startswith("#")
]
except Exception as e:
print(f"❌ Error loading channels from {channels_file}: {e}")
return []
def load_channels(channel_file: str = None) -> List[str]:
"""Load channel URLs from file."""
if channel_file is None:
# Use JSON configuration
data_path_manager = get_data_path_manager()
if data_path_manager.file_exists("channels.json"):
return load_channels_from_json()
else:
return []
else:
if channel_file.endswith(".json"):
return load_channels_from_json(channel_file)
else:
return load_channels_from_text(channel_file)
def get_channel_url_by_name(channel_name: str) -> str:
"""Look up a channel URL by its name from the channels configuration."""
channel_urls = load_channels()
# Normalize the channel name for comparison
normalized_name = channel_name.lower().replace("@", "").replace("karaoke", "").strip()
for url in channel_urls:
# Extract channel name from URL
if "/@" in url:
url_channel_name = url.split("/@")[1].split("/")[0].lower()
if url_channel_name == normalized_name or url_channel_name.replace("karaoke", "").strip() == normalized_name:
return url
return None
def main(): def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Karaoke Video Downloader - Download YouTube playlists and channel videos for karaoke (default: downloads latest videos from all channels)", description="Karaoke Video Downloader - Download YouTube playlists and channel videos for karaoke",
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=""" epilog="""
Examples: Examples:
python download_karaoke.py --limit 10 # Download latest 10 videos from all channels python download_karaoke.py https://www.youtube.com/playlist?list=XYZ
python download_karaoke.py --songlist-only --limit 10 # Download only songlist songs across channels python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --channel-focus SingKingKaraoke --limit 5 # Download from specific channel python download_karaoke.py --file data/channels.txt
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos # Download ALL videos from channel
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos # Download from specific channel URL
python download_karaoke.py --file data/channels.txt # Download from custom channel list
python download_karaoke.py --reset-channel SingKingKaraoke --delete-files python download_karaoke.py --reset-channel SingKingKaraoke --delete-files
""", """,
) )
@ -182,34 +92,13 @@ Examples:
parser.add_argument( parser.add_argument(
"--songlist-priority", "--songlist-priority",
action="store_true", action="store_true",
help="Prioritize downloads based on songList.json in the data directory (default: enabled)", help="Prioritize downloads based on data/songList.json (default: enabled)",
) )
parser.add_argument( parser.add_argument(
"--no-songlist-priority", "--no-songlist-priority",
action="store_true", action="store_true",
help="Disable songlist prioritization", help="Disable songlist prioritization",
) )
parser.add_argument(
"--generate-unmatched-report",
action="store_true",
help="Generate a report of songs that couldn't be found in any channel (runs after downloads)",
)
parser.add_argument(
"--show-pagination",
action="store_true",
help="Show page-by-page progress when downloading channel video lists (slower but more detailed)",
)
parser.add_argument(
"--parallel-channels",
action="store_true",
help="Enable parallel channel scanning for faster channel processing (scans multiple channels simultaneously)",
)
parser.add_argument(
"--channel-workers",
type=int,
default=3,
help="Number of parallel channel scanning workers (default: 3, max: 10)",
)
parser.add_argument( parser.add_argument(
"--songlist-only", "--songlist-only",
action="store_true", action="store_true",
@ -221,16 +110,6 @@ Examples:
metavar="PLAYLIST_TITLE", metavar="PLAYLIST_TITLE",
help='Focus on specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")', help='Focus on specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")',
) )
parser.add_argument(
"--songlist-file",
metavar="FILE_PATH",
help="Custom songlist file path to use with --songlist-focus (default: songList.json in the data directory)",
)
parser.add_argument(
"--force",
action="store_true",
help="Force download from channels regardless of whether songs are already downloaded, on server, or marked as duplicates",
)
parser.add_argument( parser.add_argument(
"--songlist-status", "--songlist-status",
action="store_true", action="store_true",
@ -267,7 +146,7 @@ Examples:
parser.add_argument( parser.add_argument(
"--latest-per-channel", "--latest-per-channel",
action="store_true", action="store_true",
help="Download the latest N videos from each channel (use with --limit) [DEPRECATED: This is now the default behavior]", help="Download the latest N videos from each channel (use with --limit)",
) )
parser.add_argument( parser.add_argument(
"--fuzzy-match", "--fuzzy-match",
@ -277,50 +156,19 @@ Examples:
parser.add_argument( parser.add_argument(
"--fuzzy-threshold", "--fuzzy-threshold",
type=int, type=int,
default=DEFAULT_FUZZY_THRESHOLD, default=90,
help=f"Fuzzy match threshold (0-100, default {DEFAULT_FUZZY_THRESHOLD})", help="Fuzzy match threshold (0-100, default 90)",
) )
parser.add_argument( parser.add_argument(
"--parallel", "--parallel",
action="store_true", action="store_true",
help="Enable parallel downloads for improved speed (3-5x faster for large batches, defaults to 3 workers)", help="Enable parallel downloads for improved speed",
) )
parser.add_argument( parser.add_argument(
"--workers", "--workers",
type=int, type=int,
default=3, default=3,
help="Number of parallel download workers (default: 3, max: 10, only used with --parallel)", help="Number of parallel download workers (default: 3, max: 10)",
)
parser.add_argument(
"--generate-songlist",
nargs="+",
metavar="DIRECTORY",
help="Generate song list from MP4 files with ID3 tags in specified directories",
)
parser.add_argument(
"--no-append-songlist",
action="store_true",
help="Create a new song list instead of appending when using --generate-songlist",
)
parser.add_argument(
"--manual",
action="store_true",
help="Download from manual videos collection (manual_videos.json in the data directory)",
)
parser.add_argument(
"--channel-focus",
type=str,
help="Download from a specific channel by name (e.g., 'SingKingKaraoke')",
)
parser.add_argument(
"--all-videos",
action="store_true",
help="Download all videos from channel (not just songlist matches), skipping existing files",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Build download plan and show what would be downloaded without actually downloading anything",
) )
args = parser.parse_args() args = parser.parse_args()
@ -329,11 +177,6 @@ Examples:
print("❌ Error: --workers must be between 1 and 10") print("❌ Error: --workers must be between 1 and 10")
sys.exit(1) sys.exit(1)
# Validate channel workers argument
if args.channel_workers < 1 or args.channel_workers > 10:
print("❌ Error: --channel-workers must be between 1 and 10")
sys.exit(1)
# Load configuration to get platform-aware yt-dlp path # Load configuration to get platform-aware yt-dlp path
from karaoke_downloader.config_manager import load_config from karaoke_downloader.config_manager import load_config
config = load_config() config = load_config()
@ -344,12 +187,13 @@ Examples:
# It's a command string, test if it works # It's a command string, test if it works
try: try:
import subprocess import subprocess
cmd = yt_dlp_path.split() + ["--version"] from karaoke_downloader.youtube_utils import _parse_yt_dlp_command
cmd = _parse_yt_dlp_command(yt_dlp_path) + ["--version"]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10) result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
if result.returncode != 0: if result.returncode != 0:
raise Exception(f"Command failed: {result.stderr}") raise Exception(f"Command failed: {result.stderr}")
except Exception as e: except Exception as e:
platform_name = "macOS" if sys.platform == "darwin" else "Windows" platform_name = "macOS" if sys.platform == "darwin" else "Windows" if sys.platform == "win32" else "Linux"
print(f"❌ Error: yt-dlp command failed: {yt_dlp_path}") print(f"❌ Error: yt-dlp command failed: {yt_dlp_path}")
print(f"Please ensure yt-dlp is properly installed for {platform_name}") print(f"Please ensure yt-dlp is properly installed for {platform_name}")
print(f"Error: {e}") print(f"Error: {e}")
@ -358,7 +202,7 @@ Examples:
# It's a file path, check if it exists # It's a file path, check if it exists
yt_dlp_file = Path(yt_dlp_path) yt_dlp_file = Path(yt_dlp_path)
if not yt_dlp_file.exists(): if not yt_dlp_file.exists():
platform_name = "macOS" if sys.platform == "darwin" else "Windows" platform_name = "macOS" if sys.platform == "darwin" else "Windows" if sys.platform == "win32" else "Linux"
binary_name = yt_dlp_file.name binary_name = yt_dlp_file.name
print(f"❌ Error: {binary_name} not found in downloader/ directory") print(f"❌ Error: {binary_name} not found in downloader/ directory")
print(f"Please ensure {binary_name} is present in the downloader/ folder for {platform_name}") print(f"Please ensure {binary_name} is present in the downloader/ folder for {platform_name}")
@ -392,19 +236,9 @@ Examples:
if args.songlist_focus: if args.songlist_focus:
downloader.songlist_focus_titles = args.songlist_focus downloader.songlist_focus_titles = args.songlist_focus
downloader.songlist_only = True # Enable songlist-only mode when focusing downloader.songlist_only = True # Enable songlist-only mode when focusing
args.songlist_only = True # Also set the args flag to ensure CLI logic works
print( print(
f"🎯 Songlist focus mode enabled for playlists: {', '.join(args.songlist_focus)}" f"🎯 Songlist focus mode enabled for playlists: {', '.join(args.songlist_focus)}"
) )
if args.songlist_file:
downloader.songlist_file_path = args.songlist_file
print(f"📁 Using custom songlist file: {args.songlist_file}")
if args.force:
downloader.force_download = True
print("💪 Force mode enabled - will download regardless of existing files or server duplicates")
if args.dry_run:
downloader.dry_run = True
print("🔍 Dry run mode enabled - will show download plan without downloading")
if args.resolution != "720p": if args.resolution != "720p":
downloader.config_manager.update_resolution(args.resolution) downloader.config_manager.update_resolution(args.resolution)
@ -418,16 +252,17 @@ Examples:
sys.exit(0) sys.exit(0)
# --- END NEW --- # --- END NEW ---
# --- NEW: If no URL or file is provided, but --songlist-only is set, use all channels --- # --- NEW: If no URL or file is provided, but --songlist-only is set, use all channels in data/channels.txt ---
if (args.songlist_only or args.songlist_focus) and not args.url and not args.file: if args.songlist_only and not args.url and not args.file:
channel_urls = load_channels() channels_file = Path("data/channels.txt")
if channel_urls: if channels_file.exists():
args.file = str(channels_file)
print( print(
"📋 No URL or --file provided, defaulting to all configured channels for songlist mode." "📋 No URL or --file provided, defaulting to all channels in data/channels.txt for songlist-only mode."
) )
else: else:
print( print(
"❌ No URL, --file, or channel configuration found. Please provide a channel URL or create channels.json in the data directory." "❌ No URL, --file, or data/channels.txt found. Please provide a channel URL or a file with channel URLs."
) )
sys.exit(1) sys.exit(1)
# --- END NEW --- # --- END NEW ---
@ -447,22 +282,6 @@ Examples:
print(" Songs will be re-checked against the server on next run.") print(" Songs will be re-checked against the server on next run.")
sys.exit(0) sys.exit(0)
if args.generate_songlist:
from karaoke_downloader.songlist_generator import SongListGenerator
print("🎵 Generating song list from MP4 files with ID3 tags...")
generator = SongListGenerator()
try:
generator.generate_songlist_from_multiple_directories(
args.generate_songlist,
append=not args.no_append_songlist
)
print("✅ Song list generation completed successfully!")
except Exception as e:
print(f"❌ Error generating song list: {e}")
sys.exit(1)
sys.exit(0)
if args.status: if args.status:
stats = downloader.tracker.get_statistics() stats = downloader.tracker.get_statistics()
print("🎤 Karaoke Downloader Status") print("🎤 Karaoke Downloader Status")
@ -480,10 +299,9 @@ Examples:
print("💾 Channel Cache Information") print("💾 Channel Cache Information")
print("=" * 40) print("=" * 40)
print(f"Total Channels: {cache_info['total_channels']}") print(f"Total Channels: {cache_info['total_channels']}")
print(f"Total Cached Videos: {cache_info['total_videos']}") print(f"Total Cached Videos: {cache_info['total_cached_videos']}")
print("\n📋 Channel Details:") print(f"Cache Duration: {cache_info['cache_duration_hours']} hours")
for channel in cache_info['channels']: print(f"Last Updated: {cache_info['last_updated']}")
print(f"{channel['channel']}: {channel['videos']} videos (updated: {channel['last_updated']})")
sys.exit(0) sys.exit(0)
elif args.clear_cache: elif args.clear_cache:
if args.clear_cache == "all": if args.clear_cache == "all":
@ -523,243 +341,71 @@ Examples:
if len(tracking) > 10: if len(tracking) > 10:
print(f" ... and {len(tracking) - 10} more") print(f" ... and {len(tracking) - 10} more")
sys.exit(0) sys.exit(0)
elif args.manual:
# Download from manual videos collection
print("🎤 Downloading from manual videos collection...")
success = downloader.download_channel_videos(
"manual://static",
force_refresh=args.refresh,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
)
elif args.channel_focus:
# Download from a specific channel by name
print(f"🎤 Looking up channel: {args.channel_focus}")
channel_url = get_channel_url_by_name(args.channel_focus)
if not channel_url:
print(f"❌ Channel '{args.channel_focus}' not found in configuration")
print("Available channels:")
channel_urls = load_channels()
for url in channel_urls:
if "/@" in url:
channel_name = url.split("/@")[1].split("/")[0]
print(f"{channel_name}")
sys.exit(1)
if args.all_videos:
# Download ALL videos from the channel (not just songlist matches)
print(f"🎤 Downloading ALL videos from channel: {args.channel_focus} ({channel_url})")
success = downloader.download_all_channel_videos(
channel_url,
force_refresh=args.refresh,
force_download=args.force,
limit=args.limit,
dry_run=args.dry_run,
)
else:
# Download only songlist matches from the channel
print(f"🎤 Downloading from channel: {args.channel_focus} ({channel_url})")
success = downloader.download_channel_videos(
channel_url,
force_refresh=args.refresh,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
dry_run=args.dry_run,
)
elif args.songlist_only or args.songlist_focus: elif args.songlist_only or args.songlist_focus:
# Use provided file or default to channels configuration # Use provided file or default to data/channels.txt
channel_urls = load_channels(args.file) channel_file = args.file if args.file else "data/channels.txt"
if not channel_urls: if not os.path.exists(channel_file):
print(f"No channels found in configuration") print(f"❌ Channel file not found: {channel_file}")
sys.exit(1) sys.exit(1)
limit = args.limit if args.limit else None with open(channel_file, "r", encoding="utf-8") as f:
success = downloader.download_songlist_across_channels(
channel_urls,
limit=args.limit,
force_refresh_download_plan=args.force_download_plan if hasattr(args, "force_download_plan") else False,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
show_pagination=args.show_pagination,
parallel_channels=args.parallel_channels,
max_channel_workers=args.channel_workers,
dry_run=args.dry_run,
)
elif args.latest_per_channel:
# Use provided file or default to channels configuration
channel_urls = load_channels(args.file)
if not channel_urls:
print(f"❌ No channels found in configuration")
sys.exit(1)
limit = args.limit if args.limit else DEFAULT_LATEST_PER_CHANNEL_LIMIT
force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
)
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = (
args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD
)
success = downloader.download_latest_per_channel(
channel_urls,
limit=limit,
force_refresh_download_plan=force_refresh_download_plan,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
force_download=args.force,
dry_run=args.dry_run,
)
elif args.url:
success = downloader.download_channel_videos(
args.url, force_refresh=args.refresh, dry_run=args.dry_run
)
else:
# Default behavior: download from channels (equivalent to --latest-per-channel)
print("🎯 No specific mode specified, defaulting to download from channels")
channel_urls = load_channels(args.file)
if not channel_urls:
print(f"❌ No channels found in configuration")
print("Please provide a channel URL or create channels.json in the data directory")
sys.exit(1)
limit = args.limit if args.limit else DEFAULT_LATEST_PER_CHANNEL_LIMIT
force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
)
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = (
args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD
)
success = downloader.download_latest_per_channel(
channel_urls,
limit=limit,
force_refresh_download_plan=force_refresh_download_plan,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
force_download=args.force,
dry_run=args.dry_run,
)
# Generate unmatched report if requested (additive feature)
if args.generate_unmatched_report:
from karaoke_downloader.download_planner import generate_unmatched_report, build_download_plan
from karaoke_downloader.songlist_manager import load_songlist
print("\n🔍 Generating unmatched songs report...")
# Load songlist based on focus mode
if args.songlist_focus:
# Load focused playlists
songlist_file_path = args.songlist_file if args.songlist_file else str(get_data_path_manager().get_songlist_path())
songlist_file = Path(songlist_file_path)
if not songlist_file.exists():
print(f"⚠️ Songlist file not found: {songlist_file_path}")
else:
try:
with open(songlist_file, "r", encoding="utf-8") as f:
raw_data = json.load(f)
# Filter playlists by title
focused_playlists = []
for playlist in raw_data:
playlist_title = playlist.get("title", "")
if playlist_title in args.songlist_focus:
focused_playlists.append(playlist)
if focused_playlists:
# Flatten the focused playlists into songs
focused_songs = []
seen = set()
for playlist in focused_playlists:
if "songs" in playlist:
for song in playlist["songs"]:
if "artist" in song and "title" in song:
artist = song["artist"].strip()
title = song["title"].strip()
key = f"{artist.lower()}_{title.lower()}"
if key in seen:
continue
seen.add(key)
focused_songs.append(
{
"artist": artist,
"title": title,
"position": song.get("position", 0),
}
)
songlist = focused_songs
else:
print(f"⚠️ No playlists found matching: {', '.join(args.songlist_focus)}")
songlist = []
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load songlist for report: {e}")
songlist = []
else:
# Load all songs from songlist
songlist_path = args.songlist_file if args.songlist_file else str(get_data_path_manager().get_songlist_path())
songlist = load_songlist(songlist_path)
if songlist:
# Load channel URLs
channel_file = args.file if args.file else str(get_data_path_manager().get_channels_txt_path())
if os.path.exists(channel_file):
with open(channel_file, "r", encoding='utf-8') as f:
channel_urls = [ channel_urls = [
line.strip() line.strip()
for line in f for line in f
if line.strip() and not line.strip().startswith("#") if line.strip() and not line.strip().startswith("#")
] ]
limit = args.limit if args.limit else None
print(f"📋 Analyzing {len(songlist)} songs against {len(channel_urls)} channels...") force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
# Build download plan to get unmatched songs )
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = ( fuzzy_threshold = (
args.fuzzy_threshold args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold") if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD else DEFAULT_FUZZY_THRESHOLD
) )
success = downloader.download_songlist_across_channels(
try:
download_plan, unmatched = build_download_plan(
channel_urls, channel_urls,
songlist, limit=limit,
downloader.tracker, force_refresh_download_plan=force_refresh_download_plan,
downloader.yt_dlp_path,
fuzzy_match=fuzzy_match, fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold, fuzzy_threshold=fuzzy_threshold,
) )
elif args.latest_per_channel:
if unmatched: # Use provided file or default to data/channels.txt
report_file = generate_unmatched_report(unmatched) channel_file = args.file if args.file else "data/channels.txt"
print(f"\n📋 Unmatched songs report generated successfully!") if not os.path.exists(channel_file):
print(f"📁 Report saved to: {report_file}")
print(f"📊 Summary: {len(download_plan)} songs found, {len(unmatched)} songs not found")
print(f"\n🔍 First 10 unmatched songs:")
for i, song in enumerate(unmatched[:10], 1):
print(f" {i:2d}. {song['artist']} - {song['title']}")
if len(unmatched) > 10:
print(f" ... and {len(unmatched) - 10} more songs")
else:
print(f"\n✅ All {len(songlist)} songs were found in the channels!")
except Exception as e:
print(f"❌ Error generating report: {e}")
else:
print(f"❌ Channel file not found: {channel_file}") print(f"❌ Channel file not found: {channel_file}")
sys.exit(1)
with open(channel_file, "r", encoding="utf-8") as f:
channel_urls = [
line.strip()
for line in f
if line.strip() and not line.strip().startswith("#")
]
limit = args.limit if args.limit else DEFAULT_LATEST_PER_CHANNEL_LIMIT
force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
)
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = (
args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD
)
success = downloader.download_latest_per_channel(
channel_urls,
limit=limit,
force_refresh_download_plan=force_refresh_download_plan,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
)
elif args.url:
success = downloader.download_channel_videos(
args.url, force_refresh=args.refresh
)
else: else:
print("❌ No songlist available for report generation") parser.print_help()
sys.exit(1)
# Initialize success variable
success = False
downloader.tracker.force_save() downloader.tracker.force_save()
if success: if success:
print("\n🎤 All downloads completed successfully!") print("\n🎤 All downloads completed successfully!")

View File

@ -36,7 +36,6 @@ DEFAULT_CONFIG = {
"folder_structure": { "folder_structure": {
"downloads_dir": "downloads", "downloads_dir": "downloads",
"logs_dir": "logs", "logs_dir": "logs",
"data_dir": "data",
"tracking_file": "data/karaoke_tracking.json", "tracking_file": "data/karaoke_tracking.json",
}, },
"logging": { "logging": {
@ -49,8 +48,9 @@ DEFAULT_CONFIG = {
"auto_detect_platform": True, "auto_detect_platform": True,
"yt_dlp_paths": { "yt_dlp_paths": {
"windows": "downloader/yt-dlp.exe", "windows": "downloader/yt-dlp.exe",
"macos": "downloader/yt-dlp_macos" "macos": "downloader/yt-dlp_macos",
} "linux": "downloader/yt-dlp",
},
}, },
"yt_dlp_path": "downloader/yt-dlp.exe", "yt_dlp_path": "downloader/yt-dlp.exe",
} }
@ -66,20 +66,23 @@ RESOLUTION_MAP = {
def detect_platform() -> str: def detect_platform() -> str:
"""Detect the current platform and return platform name.""" """Detect the current platform and return the appropriate platform key."""
system = platform.system().lower() system = platform.system().lower()
if system == "windows": if system == "windows":
return "windows" return "windows"
elif system == "darwin": elif system == "darwin":
return "macos" return "macos"
elif system == "linux":
return "linux"
else: else:
return "windows" # Default to Windows for other platforms # Default to windows for unknown platforms
return "windows"
def get_platform_yt_dlp_path(platform_paths: Dict[str, str]) -> str: def get_platform_yt_dlp_path(platform_paths: Dict[str, str]) -> str:
"""Get the appropriate yt-dlp path for the current platform.""" """Get the appropriate yt-dlp path for the current platform."""
platform_name = detect_platform() platform_key = detect_platform()
return platform_paths.get(platform_name, platform_paths.get("windows", "downloader/yt-dlp.exe")) return platform_paths.get(platform_key, platform_paths.get("windows", "downloader/yt-dlp.exe"))
@dataclass @dataclass
@ -136,7 +139,6 @@ class FolderStructure:
downloads_dir: str = "downloads" downloads_dir: str = "downloads"
logs_dir: str = "logs" logs_dir: str = "logs"
data_dir: str = "data"
tracking_file: str = "data/karaoke_tracking.json" tracking_file: str = "data/karaoke_tracking.json"
@ -167,21 +169,14 @@ class ConfigManager:
Manages application configuration with loading, validation, and caching. Manages application configuration with loading, validation, and caching.
""" """
def __init__(self, config_file: Union[str, Path] = "config/config.json", data_dir: Optional[str] = None): def __init__(self, config_file: Union[str, Path] = "data/config.json"):
""" """
Initialize the configuration manager. Initialize the configuration manager.
Args: Args:
config_file: Path to the configuration file config_file: Path to the configuration file
data_dir: Optional custom data directory path
""" """
# If config_file is relative and data_dir is provided, make it relative to data_dir
if data_dir and not Path(config_file).is_absolute():
self.config_file = Path(data_dir) / config_file
else:
self.config_file = Path(config_file) self.config_file = Path(config_file)
self._data_dir = data_dir
self._config: Optional[AppConfig] = None self._config: Optional[AppConfig] = None
self._last_modified: Optional[datetime] = None self._last_modified: Optional[datetime] = None
@ -342,35 +337,27 @@ class ConfigManager:
_config_manager: Optional[ConfigManager] = None _config_manager: Optional[ConfigManager] = None
def get_config_manager(config_file: Optional[Union[str, Path]] = None, data_dir: Optional[str] = None) -> ConfigManager: def get_config_manager() -> ConfigManager:
""" """
Get the global configuration manager instance. Get the global configuration manager instance.
Args:
config_file: Optional path to config file (default: "config.json" in root)
data_dir: Optional custom data directory path
Returns: Returns:
ConfigManager instance ConfigManager instance
""" """
global _config_manager global _config_manager
if _config_manager is None or config_file is not None or data_dir is not None: if _config_manager is None:
if config_file is None: _config_manager = ConfigManager()
config_file = "config/config.json"
_config_manager = ConfigManager(config_file, data_dir)
return _config_manager return _config_manager
def load_config(force_reload: bool = False, config_file: Optional[Union[str, Path]] = None, data_dir: Optional[str] = None) -> AppConfig: def load_config(force_reload: bool = False) -> AppConfig:
""" """
Load configuration using the global manager. Load configuration using the global manager.
Args: Args:
force_reload: Force reload even if file hasn't changed force_reload: Force reload even if file hasn't changed
config_file: Optional path to config file (default: "config.json" in root)
data_dir: Optional custom data directory path
Returns: Returns:
AppConfig instance AppConfig instance
""" """
return get_config_manager(config_file, data_dir).load_config(force_reload) return get_config_manager().load_config(force_reload)

View File

@ -1,184 +0,0 @@
"""
Data path management utilities for the karaoke downloader.
Provides centralized data directory path management and file path resolution.
"""
import os
from pathlib import Path
from typing import Optional
from .config_manager import get_config_manager
class DataPathManager:
"""
Manages data directory paths and provides utilities for resolving file paths
relative to the configured data directory.
"""
def __init__(self, data_dir: Optional[str] = None):
"""
Initialize the data path manager.
Args:
data_dir: Optional custom data directory path. If None, uses config.
"""
self._data_dir = data_dir
# If a custom data directory is provided, look for config.json in that directory
if data_dir:
config_file = Path(data_dir) / "config.json"
self._config_manager = get_config_manager(str(config_file))
else:
# Otherwise, use the default config.json in the root directory
self._config_manager = get_config_manager()
@property
def data_dir(self) -> Path:
"""
Get the configured data directory path.
Returns:
Path to the data directory
"""
if self._data_dir:
return Path(self._data_dir)
# Get from config
config = self._config_manager.get_config()
data_dir = getattr(config.folder_structure, 'data_dir', 'data')
return Path(data_dir)
def get_path(self, filename: str) -> Path:
"""
Get the full path to a file in the data directory.
Args:
filename: Name of the file (e.g., 'config.json', 'channels.json')
Returns:
Full path to the file
"""
return self.data_dir / filename
def get_channels_json_path(self) -> Path:
"""Get path to channels.json file."""
return self.get_path('channels.json')
def get_channels_txt_path(self) -> Path:
"""Get path to channels.txt file."""
return self.get_path('channels.txt')
def get_songlist_path(self) -> Path:
"""Get path to songList.json file."""
return self.get_path('songList.json')
def get_songlist_tracking_path(self) -> Path:
"""Get path to songlist_tracking.json file."""
return self.get_path('songlist_tracking.json')
def get_karaoke_tracking_path(self) -> Path:
"""Get path to karaoke_tracking.json file."""
return self.get_path('karaoke_tracking.json')
def get_server_duplicates_tracking_path(self) -> Path:
"""Get path to server_duplicates_tracking.json file."""
return self.get_path('server_duplicates_tracking.json')
def get_manual_videos_path(self) -> Path:
"""Get path to manual_videos.json file."""
return self.get_path('manual_videos.json')
def get_songs_path(self) -> Path:
"""Get path to songs.json file."""
return self.get_path('songs.json')
def get_channel_cache_dir(self) -> Path:
"""Get path to channel_cache directory."""
return self.get_path('channel_cache')
def get_channel_cache_path(self, channel_id: str) -> Path:
"""Get path to a specific channel cache file."""
return self.get_channel_cache_dir() / f"{channel_id}.json"
def get_download_plan_cache_path(self, plan_name: str, **kwargs) -> Path:
"""Get path to download plan cache file."""
# Create a hash from kwargs for unique cache files
import hashlib
if kwargs:
kwargs_str = str(sorted(kwargs.items()))
hash_suffix = hashlib.md5(kwargs_str.encode()).hexdigest()[:8]
plan_name = f"{plan_name}_{hash_suffix}"
return self.get_path(f"plan_latest_per_channel_{plan_name}.json")
def get_unmatched_report_path(self, timestamp: Optional[str] = None) -> Path:
"""Get path to unmatched songs report file."""
if timestamp:
return self.get_path(f"unmatched_songs_report_{timestamp}.json")
return self.get_path("unmatched_songs_report.json")
def ensure_data_dir_exists(self) -> None:
"""Ensure the data directory exists."""
self.data_dir.mkdir(parents=True, exist_ok=True)
def list_data_files(self) -> list:
"""List all files in the data directory."""
if not self.data_dir.exists():
return []
files = []
for file_path in self.data_dir.iterdir():
if file_path.is_file():
files.append(file_path.name)
return sorted(files)
def file_exists(self, filename: str) -> bool:
"""Check if a file exists in the data directory."""
return self.get_path(filename).exists()
# Global data path manager instance
_data_path_manager: Optional[DataPathManager] = None
def get_data_path_manager(data_dir: Optional[str] = None) -> DataPathManager:
"""
Get the global data path manager instance.
Args:
data_dir: Optional custom data directory path
Returns:
DataPathManager instance
"""
global _data_path_manager
if _data_path_manager is None or data_dir is not None:
_data_path_manager = DataPathManager(data_dir)
return _data_path_manager
def get_data_path(filename: str, data_dir: Optional[str] = None) -> Path:
"""
Get the full path to a file in the data directory.
Args:
filename: Name of the file
data_dir: Optional custom data directory path
Returns:
Full path to the file
"""
return get_data_path_manager(data_dir).get_path(filename)
def get_data_dir(data_dir: Optional[str] = None) -> Path:
"""
Get the configured data directory path.
Args:
data_dir: Optional custom data directory path
Returns:
Path to the data directory
"""
return get_data_path_manager(data_dir).data_dir

View File

@ -20,12 +20,6 @@ from karaoke_downloader.youtube_utils import (
execute_yt_dlp_command, execute_yt_dlp_command,
show_available_formats, show_available_formats,
) )
from karaoke_downloader.file_utils import (
cleanup_temp_files,
get_unique_filename,
is_valid_mp4_file,
sanitize_filename,
)
class DownloadPipeline: class DownloadPipeline:
@ -69,15 +63,9 @@ class DownloadPipeline:
True if successful, False otherwise True if successful, False otherwise
""" """
try: try:
# Step 1: Prepare file path and check for existing files # Step 1: Prepare file path
output_path, file_exists = get_unique_filename(self.downloads_dir, channel_name, artist, title) filename = sanitize_filename(artist, title)
output_path = self.downloads_dir / channel_name / filename
if file_exists:
print(f"⏭️ Skipping download - file already exists: {output_path.name}")
# Still add tags and track the existing file
if self._add_tags(output_path, artist, title, channel_name):
self._track_download(output_path, artist, title, video_id, channel_name)
return True
# Step 2: Download video # Step 2: Download video
if not self._download_video(video_id, output_path, artist, title, channel_name): if not self._download_video(video_id, output_path, artist, title, channel_name):
@ -226,10 +214,8 @@ class DownloadPipeline:
) -> bool: ) -> bool:
"""Step 3: Add ID3 tags to the downloaded file.""" """Step 3: Add ID3 tags to the downloaded file."""
try: try:
# Use the same artist/title as the filename for consistency
# Don't add "(Karaoke Version)" to the ID3 tag title
add_id3_tags( add_id3_tags(
output_path, f"{artist} - {title}", channel_name output_path, f"{artist} - {title} (Karaoke Version)", channel_name
) )
print(f"🏷️ Added ID3 tags: {artist} - {title}") print(f"🏷️ Added ID3 tags: {artist} - {title}")
return True return True
@ -297,10 +283,9 @@ class DownloadPipeline:
video_title = video.get("title", "") video_title = video.get("title", "")
# Extract artist and title from video title # Extract artist and title from video title
from karaoke_downloader.channel_parser import ChannelParser from karaoke_downloader.id3_utils import extract_artist_title
channel_parser = ChannelParser() artist, title = extract_artist_title(video_title)
artist, title = channel_parser.extract_artist_title(video_title, channel_name)
print(f" ({i}/{total}) Processing: {artist} - {title}") print(f" ({i}/{total}) Processing: {artist} - {title}")

View File

@ -3,31 +3,19 @@ Download plan building utilities.
Handles pre-scanning channels and building download plans. Handles pre-scanning channels and building download plans.
""" """
import concurrent.futures
import hashlib
import json
import sys
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from karaoke_downloader.cache_manager import ( from karaoke_downloader.cache_manager import (
delete_plan_cache, delete_plan_cache,
get_download_plan_cache_file, get_download_plan_cache_file,
load_cached_plan, load_cached_plan,
save_plan_cache, save_plan_cache,
) )
# Import all fuzzy matching functions
from karaoke_downloader.fuzzy_matcher import ( from karaoke_downloader.fuzzy_matcher import (
create_song_key, create_song_key,
create_video_key, extract_artist_title,
get_similarity_function, get_similarity_function,
is_exact_match, is_exact_match,
is_fuzzy_match, is_fuzzy_match,
normalize_title,
) )
from karaoke_downloader.channel_parser import ChannelParser
from karaoke_downloader.data_path_manager import get_data_path_manager
from karaoke_downloader.youtube_utils import get_channel_info from karaoke_downloader.youtube_utils import get_channel_info
# Constants # Constants
@ -35,156 +23,6 @@ DEFAULT_FILENAME_LENGTH_LIMIT = 100
DEFAULT_ARTIST_LENGTH_LIMIT = 30 DEFAULT_ARTIST_LENGTH_LIMIT = 30
DEFAULT_TITLE_LENGTH_LIMIT = 60 DEFAULT_TITLE_LENGTH_LIMIT = 60
DEFAULT_FUZZY_THRESHOLD = 85 DEFAULT_FUZZY_THRESHOLD = 85
DEFAULT_DISPLAY_LIMIT = 10
def generate_unmatched_report(unmatched: List[Dict[str, Any]], report_path: str = None) -> str:
"""
Generate a detailed report of unmatched songs and save it to a file.
Args:
unmatched: List of unmatched songs from build_download_plan
report_path: Optional path to save the report (default: data/unmatched_songs_report.json)
Returns:
Path to the saved report file
"""
if report_path is None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_path = str(get_data_path_manager().get_unmatched_report_path(timestamp))
report_data = {
"generated_at": datetime.now().isoformat(),
"total_unmatched": len(unmatched),
"unmatched_songs": []
}
for song in unmatched:
report_data["unmatched_songs"].append({
"artist": song["artist"],
"title": song["title"],
"position": song.get("position", 0),
"search_key": create_song_key(song["artist"], song["title"])
})
# Sort by artist, then by title for easier reading
report_data["unmatched_songs"].sort(key=lambda x: (x["artist"].lower(), x["title"].lower()))
# Ensure the data directory exists
report_file = Path(report_path)
report_file.parent.mkdir(parents=True, exist_ok=True)
# Save the report
with open(report_file, 'w', encoding='utf-8') as f:
json.dump(report_data, f, indent=2, ensure_ascii=False)
return str(report_file)
def _scan_channel_for_matches(
channel_url,
channel_name,
channel_id,
song_keys,
song_lookup,
fuzzy_match,
fuzzy_threshold,
show_pagination,
yt_dlp_path,
tracker,
):
"""
Scan a single channel for matches (used in parallel processing).
Args:
channel_url: URL of the channel to scan
channel_name: Name of the channel
channel_id: ID of the channel
song_keys: Set of song keys to match against
song_lookup: Dictionary mapping song keys to song data
fuzzy_match: Whether to use fuzzy matching
fuzzy_threshold: Threshold for fuzzy matching
show_pagination: Whether to show pagination progress
yt_dlp_path: Path to yt-dlp executable
tracker: Tracking manager instance
Returns:
List of video matches found in this channel
"""
print(f"\n🚦 Scanning channel: {channel_name} ({channel_url})")
# Get channel info if not provided
if not channel_name or not channel_id:
channel_name, channel_id = get_channel_info(channel_url)
# Fetch video list from channel
available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False, show_pagination=show_pagination
)
print(f" 📊 Channel has {len(available_videos)} videos to scan")
video_matches = []
# Pre-process video titles for efficient matching
channel_parser = ChannelParser()
if fuzzy_match:
# For fuzzy matching, create normalized video keys
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
# Find best match among remaining songs
best_match = None
best_score = 0
for song_key in song_keys:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
if best_match:
song = song_lookup[best_match]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": best_score,
}
)
# Remove matched song from future consideration
del song_lookup[best_match]
song_keys.remove(best_match)
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
if video_key in song_keys:
song = song_lookup[video_key]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": 100,
}
)
# Remove matched song from future consideration
del song_lookup[video_key]
song_keys.remove(video_key)
print(f" ✅ Found {len(video_matches)} matches in {channel_name}")
return video_matches
def build_download_plan( def build_download_plan(
@ -194,9 +32,6 @@ def build_download_plan(
yt_dlp_path, yt_dlp_path,
fuzzy_match=False, fuzzy_match=False,
fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD,
show_pagination=False,
parallel_channels=False,
max_channel_workers=3,
): ):
""" """
For each song in undownloaded, scan all channels for a match. For each song in undownloaded, scan all channels for a match.
@ -217,120 +52,6 @@ def build_download_plan(
song_keys.add(key) song_keys.add(key)
song_lookup[key] = song song_lookup[key] = song
if parallel_channels:
print(f"🚀 Running parallel channel scanning with {max_channel_workers} workers.")
# Create a thread-safe copy of song data for parallel processing
import threading
song_keys_lock = threading.Lock()
song_lookup_lock = threading.Lock()
def scan_channel_safe(channel_url):
"""Thread-safe channel scanning function."""
print(f"\n🚦 Scanning channel: {channel_url}")
# Get channel info
channel_name, channel_id = get_channel_info(channel_url)
print(f" ✅ Channel info: {channel_name} (ID: {channel_id})")
# Fetch video list from channel
available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False, show_pagination=show_pagination
)
print(f" 📊 Channel has {len(available_videos)} videos to scan")
video_matches = []
# Pre-process video titles for efficient matching
channel_parser = ChannelParser()
if fuzzy_match:
# For fuzzy matching, create normalized video keys
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
# Find best match among remaining songs (thread-safe)
best_match = None
best_score = 0
with song_keys_lock:
available_song_keys = list(song_keys) # Copy for iteration
for song_key in available_song_keys:
with song_lookup_lock:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
if best_match:
with song_lookup_lock:
if best_match in song_lookup: # Double-check it's still available
song = song_lookup[best_match]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": best_score,
}
)
# Remove matched song from future consideration
del song_lookup[best_match]
with song_keys_lock:
song_keys.discard(best_match)
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
with song_lookup_lock:
if video_key in song_keys and video_key in song_lookup:
song = song_lookup[video_key]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": 100,
}
)
# Remove matched song from future consideration
del song_lookup[video_key]
with song_keys_lock:
song_keys.discard(video_key)
print(f" ✅ Found {len(video_matches)} matches in {channel_name}")
return video_matches
# Execute parallel channel scanning
with concurrent.futures.ThreadPoolExecutor(max_workers=max_channel_workers) as executor:
# Submit all channel scanning tasks
future_to_channel = {
executor.submit(scan_channel_safe, channel_url): channel_url
for channel_url in channel_urls
}
# Process results as they complete
for future in concurrent.futures.as_completed(future_to_channel):
channel_url = future_to_channel[future]
try:
video_matches = future.result()
plan.extend(video_matches)
channel_name, _ = get_channel_info(channel_url)
channel_match_counts[channel_name] = len(video_matches)
except Exception as e:
print(f"⚠️ Error processing channel {channel_url}: {e}")
channel_name, _ = get_channel_info(channel_url)
channel_match_counts[channel_name] = 0
else:
for i, channel_url in enumerate(channel_urls, 1): for i, channel_url in enumerate(channel_urls, 1):
print(f"\n🚦 Starting channel {i}/{len(channel_urls)}: {channel_url}") print(f"\n🚦 Starting channel {i}/{len(channel_urls)}: {channel_url}")
print(f" 🔍 Getting channel info...") print(f" 🔍 Getting channel info...")
@ -338,7 +59,7 @@ def build_download_plan(
print(f" ✅ Channel info: {channel_name} (ID: {channel_id})") print(f" ✅ Channel info: {channel_name} (ID: {channel_id})")
print(f" 🔍 Fetching video list from channel...") print(f" 🔍 Fetching video list from channel...")
available_videos = tracker.get_channel_video_list( available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False, show_pagination=show_pagination channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False
) )
print( print(
f" 📊 Channel has {len(available_videos)} videos to scan against {len(undownloaded)} songlist songs" f" 📊 Channel has {len(available_videos)} videos to scan against {len(undownloaded)} songlist songs"
@ -347,11 +68,10 @@ def build_download_plan(
video_matches = [] # Initialize video_matches for this channel video_matches = [] # Initialize video_matches for this channel
# Pre-process video titles for efficient matching # Pre-process video titles for efficient matching
channel_parser = ChannelParser()
if fuzzy_match: if fuzzy_match:
# For fuzzy matching, create normalized video keys # For fuzzy matching, create normalized video keys
for video in available_videos: for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name) v_artist, v_title = extract_artist_title(video["title"])
video_key = create_song_key(v_artist, v_title) video_key = create_song_key(v_artist, v_title)
# Find best match among remaining songs # Find best match among remaining songs
@ -384,7 +104,7 @@ def build_download_plan(
else: else:
# For exact matching, use direct key comparison # For exact matching, use direct key comparison
for video in available_videos: for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name) v_artist, v_title = extract_artist_title(video["title"])
video_key = create_song_key(v_artist, v_title) video_key = create_song_key(v_artist, v_title)
if video_key in song_keys: if video_key in song_keys:
@ -423,13 +143,4 @@ def build_download_plan(
f" TOTAL: {sum(channel_match_counts.values())} matches across {len(channel_match_counts)} channels." f" TOTAL: {sum(channel_match_counts.values())} matches across {len(channel_match_counts)} channels."
) )
# Generate unmatched songs report if there are any
if unmatched:
try:
report_file = generate_unmatched_report(unmatched)
print(f"\n📋 Unmatched songs report saved to: {report_file}")
print(f"📋 Total unmatched songs: {len(unmatched)}")
except Exception as e:
print(f"⚠️ Could not generate unmatched songs report: {e}")
return plan, unmatched return plan, unmatched

File diff suppressed because it is too large Load Diff

View File

@ -34,6 +34,7 @@ def sanitize_filename(
# Clean up title # Clean up title
safe_title = ( safe_title = (
title.replace("(From ", "") title.replace("(From ", "")
.replace(")", "")
.replace(" - ", " ") .replace(" - ", " ")
.replace(":", "") .replace(":", "")
) )
@ -53,18 +54,11 @@ def sanitize_filename(
) )
safe_artist = safe_artist.strip() safe_artist = safe_artist.strip()
# Create filename - handle empty artist case # Create filename
if not safe_artist or safe_artist.strip() == "":
# If no artist, just use the title
filename = f"{safe_title}.mp4"
else:
filename = f"{safe_artist} - {safe_title}.mp4" filename = f"{safe_artist} - {safe_title}.mp4"
# Limit filename length if needed # Limit filename length if needed
if len(filename) > max_length: if len(filename) > max_length:
if not safe_artist or safe_artist.strip() == "":
filename = f"{safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
else:
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4" filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
return filename return filename
@ -87,14 +81,6 @@ def generate_possible_filenames(
safe_title = sanitize_title_for_filenames(title) safe_title = sanitize_title_for_filenames(title)
safe_artist = artist.replace("'", "").replace('"', "").strip() safe_artist = artist.replace("'", "").replace('"', "").strip()
# Handle empty artist case
if not safe_artist or safe_artist.strip() == "":
return [
f"{safe_title}.mp4", # Songlist mode (no artist)
f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
f"{safe_title} (Karaoke Version).mp4", # Channel videos mode (no artist)
]
else:
return [ return [
f"{safe_artist} - {safe_title}.mp4", # Songlist mode f"{safe_artist} - {safe_title}.mp4", # Songlist mode
f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
@ -126,7 +112,6 @@ def check_file_exists_with_patterns(
) -> Tuple[bool, Optional[Path]]: ) -> Tuple[bool, Optional[Path]]:
""" """
Check if a file exists using multiple possible filename patterns. Check if a file exists using multiple possible filename patterns.
Also checks for files with (2), (3), etc. suffixes that yt-dlp might create.
Args: Args:
downloads_dir: Base downloads directory downloads_dir: Base downloads directory
@ -145,56 +130,15 @@ def check_file_exists_with_patterns(
# Apply length limits if needed # Apply length limits if needed
safe_artist = artist.replace("'", "").replace('"', "").strip() safe_artist = artist.replace("'", "").replace('"', "").strip()
safe_title = sanitize_title_for_filenames(title) safe_title = sanitize_title_for_filenames(title)
if not safe_artist or safe_artist.strip() == "":
filename = f"{safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
else:
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4" filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
# Check for exact filename match
file_path = channel_dir / filename file_path = channel_dir / filename
if file_path.exists() and file_path.stat().st_size > 0: if file_path.exists() and file_path.stat().st_size > 0:
return True, file_path return True, file_path
# Check for files with (2), (3), etc. suffixes
base_name = filename.replace(".mp4", "")
for suffix in range(2, 10): # Check up to (9)
suffixed_filename = f"{base_name} ({suffix}).mp4"
suffixed_path = channel_dir / suffixed_filename
if suffixed_path.exists() and suffixed_path.stat().st_size > 0:
return True, suffixed_path
return False, None return False, None
def get_unique_filename(
downloads_dir: Path, channel_name: str, artist: str, title: str
) -> Tuple[Path, bool]:
"""
Get a unique filename for download, checking for existing files including duplicates.
Args:
downloads_dir: Base downloads directory
channel_name: Channel name
artist: Song artist
title: Song title
Returns:
Tuple of (file_path, is_existing) where is_existing indicates if a file already exists
"""
filename = sanitize_filename(artist, title)
channel_dir = downloads_dir / channel_name
file_path = channel_dir / filename
# Check if file already exists
exists, existing_path = check_file_exists_with_patterns(downloads_dir, channel_name, artist, title)
if exists and existing_path:
print(f"📁 File already exists: {existing_path.name}")
return existing_path, True
return file_path, False
def ensure_directory_exists(directory: Path) -> None: def ensure_directory_exists(directory: Path) -> None:
""" """
Ensure a directory exists, creating it if necessary. Ensure a directory exists, creating it if necessary.

View File

@ -32,72 +32,10 @@ def normalize_title(title):
def extract_artist_title(video_title): def extract_artist_title(video_title):
""" """Extract artist and title from video title."""
Extract artist and title from video title.
This function handles multiple common video title formats found on YouTube karaoke channels:
1. "Artist - Title" format: "38 Special - Hold On Loosely"
2. "Title Karaoke | Artist Karaoke Version" format: "Hold On Loosely Karaoke | 38 Special Karaoke Version"
3. "Title Artist KARAOKE" format: "Hold On Loosely 38 Special KARAOKE"
Args:
video_title (str): The YouTube video title to parse
Returns:
tuple: (artist, title) where artist and title are strings. If parsing fails,
artist will be empty string and title will be the full video title.
Examples:
>>> extract_artist_title("38 Special - Hold On Loosely")
("38 Special", "Hold On Loosely")
>>> extract_artist_title("Hold On Loosely Karaoke | 38 Special Karaoke Version")
("38 Special", "Hold On Loosely")
>>> extract_artist_title("Unknown Format Video Title")
("", "Unknown Format Video Title")
"""
# Handle "Artist - Title" format
if " - " in video_title: if " - " in video_title:
parts = video_title.split(" - ", 1) parts = video_title.split(" - ", 1)
return parts[0].strip(), parts[1].strip() return parts[0].strip(), parts[1].strip()
# Handle "Title Karaoke | Artist Karaoke Version" format
if " | " in video_title and "karaoke" in video_title.lower():
parts = video_title.split(" | ", 1)
title_part = parts[0].strip()
artist_part = parts[1].strip()
# Clean up the parts
title = title_part.replace("Karaoke", "").strip()
artist = artist_part.replace("Karaoke Version", "").strip()
return artist, title
# Handle "Title Artist KARAOKE" format
if "karaoke" in video_title.lower():
# Try to find the artist by looking for common patterns
title_lower = video_title.lower()
# Look for patterns like "Title Artist KARAOKE"
# This is a simplified approach - we'll need to improve this
words = video_title.split()
if len(words) >= 3:
# Assume the last word before "KARAOKE" is part of the artist
for i, word in enumerate(words):
if "karaoke" in word.lower():
if i >= 2:
# Everything before the last word before KARAOKE is title
# Everything after is artist
title = " ".join(words[:i-1])
artist = " ".join(words[i-1:])
return artist, title
# If we can't parse it, return empty artist and full title
return "", video_title
# Default: return empty artist and full title
return "", video_title return "", video_title

View File

@ -7,33 +7,17 @@ except ImportError:
MUTAGEN_AVAILABLE = False MUTAGEN_AVAILABLE = False
def clean_channel_name(channel_name: str) -> str: def extract_artist_title(video_title):
""" title = (
Clean channel name for ID3 tagging by removing @ symbol and ensuring it's alpha-only. video_title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip()
)
Args: if " - " in title:
channel_name: Raw channel name (may contain @ symbol) parts = title.split(" - ", 1)
if len(parts) == 2:
Returns: artist = parts[0].strip()
Cleaned channel name suitable for ID3 tags song_title = parts[1].strip()
""" return artist, song_title
# Remove @ symbol if present return "Unknown Artist", title
if channel_name.startswith('@'):
channel_name = channel_name[1:]
# Remove any non-alphanumeric characters and convert to single word
# Keep only letters, numbers, and spaces, then take the first word
cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', channel_name)
words = cleaned.split()
if words:
return words[0] # Return only the first word
return "Unknown"
# Import the enhanced extract_artist_title function from fuzzy_matcher.py
# This ensures consistent parsing across all modules and supports multiple video title formats
from karaoke_downloader.fuzzy_matcher import extract_artist_title
def add_id3_tags(file_path, video_title, channel_name): def add_id3_tags(file_path, video_title, channel_name):
@ -42,13 +26,12 @@ def add_id3_tags(file_path, video_title, channel_name):
return return
try: try:
artist, title = extract_artist_title(video_title) artist, title = extract_artist_title(video_title)
clean_channel = clean_channel_name(channel_name)
mp4 = MP4(str(file_path)) mp4 = MP4(str(file_path))
mp4["\xa9nam"] = title mp4["\xa9nam"] = title
mp4["\xa9ART"] = artist mp4["\xa9ART"] = artist
mp4["\xa9alb"] = clean_channel # Use clean channel name only, no suffix mp4["\xa9alb"] = f"{channel_name} Karaoke"
mp4["\xa9gen"] = "Karaoke" mp4["\xa9gen"] = "Karaoke"
mp4.save() mp4.save()
print(f"📝 Added ID3 tags: Artist='{artist}', Title='{title}', Album='{clean_channel}'") print(f"📝 Added ID3 tags: Artist='{artist}', Title='{title}'")
except Exception as e: except Exception as e:
print(f"⚠️ Could not add ID3 tags: {e}") print(f"⚠️ Could not add ID3 tags: {e}")

View File

@ -1,83 +0,0 @@
"""
Manual video manager for handling static video collections.
"""
import json
from pathlib import Path
from typing import Dict, List, Optional, Any
from karaoke_downloader.data_path_manager import get_data_path_manager
def load_manual_videos(manual_file: str = None) -> List[Dict[str, Any]]:
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""
Load manual videos from the JSON file.
Args:
manual_file: Path to manual videos JSON file
Returns:
List of video dictionaries
"""
manual_path = Path(manual_file)
if not manual_path.exists():
print(f"⚠️ Manual videos file not found: {manual_file}")
return []
try:
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
videos = data.get("videos", [])
print(f"📋 Loaded {len(videos)} manual videos from {manual_file}")
return videos
except Exception as e:
print(f"❌ Error loading manual videos: {e}")
return []
def get_manual_videos_for_channel(channel_name: str, manual_file: str = None) -> List[Dict[str, Any]]:
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""
Get manual videos for a specific channel.
Args:
channel_name: Channel name (should be "@ManualVideos")
manual_file: Path to manual videos JSON file
Returns:
List of video dictionaries
"""
if channel_name != "@ManualVideos":
return []
return load_manual_videos(manual_file)
def is_manual_channel(channel_url: str) -> bool:
"""
Check if a channel URL is a manual channel.
Args:
channel_url: Channel URL
Returns:
True if it's a manual channel
"""
return channel_url == "manual://static"
def get_manual_channel_info(channel_url: str) -> tuple[str, str]:
"""
Get channel info for manual channels.
Args:
channel_url: Channel URL
Returns:
Tuple of (channel_name, channel_id)
"""
if channel_url == "manual://static":
return "@ManualVideos", "manual"
return None, None

View File

@ -56,6 +56,14 @@ def update_resolution(resolution):
"include_console": True, "include_console": True,
"include_file": True, "include_file": True,
}, },
"platform_settings": {
"auto_detect_platform": True,
"yt_dlp_paths": {
"windows": "downloader/yt-dlp.exe",
"macos": "downloader/yt-dlp_macos",
"linux": "downloader/yt-dlp",
},
},
"yt_dlp_path": "downloader/yt-dlp.exe", "yt_dlp_path": "downloader/yt-dlp.exe",
} }

View File

@ -7,40 +7,28 @@ import json
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
def load_server_songs(songs_path="data/songs.json"):
def load_server_songs(songs_path=None): """Load the list of songs already available on the server."""
if songs_path is None:
songs_path = str(get_data_path_manager().get_songs_path())
"""Load the list of songs already available on the server with format information."""
songs_file = Path(songs_path) songs_file = Path(songs_path)
if not songs_file.exists(): if not songs_file.exists():
print(f"⚠️ Server songs file not found: {songs_path}") print(f"⚠️ Server songs file not found: {songs_path}")
return {} return set()
try: try:
with open(songs_file, "r", encoding="utf-8") as f: with open(songs_file, "r", encoding="utf-8") as f:
data = json.load(f) data = json.load(f)
server_songs = {} server_songs = set()
for song in data: for song in data:
if "artist" in song and "title" in song and "path" in song: if "artist" in song and "title" in song:
artist = song["artist"].strip() artist = song["artist"].strip()
title = song["title"].strip() title = song["title"].strip()
path = song["path"].strip()
key = f"{artist.lower()}_{normalize_title(title)}" key = f"{artist.lower()}_{normalize_title(title)}"
server_songs[key] = { server_songs.add(key)
"artist": artist,
"title": title,
"path": path,
"is_mp3": path.lower().endswith('.mp3'),
"is_cdg": 'cdg' in path.lower(),
"is_mp4": path.lower().endswith('.mp4')
}
print(f"📋 Loaded {len(server_songs)} songs from server (songs.json)") print(f"📋 Loaded {len(server_songs)} songs from server (songs.json)")
return server_songs return server_songs
except (json.JSONDecodeError, FileNotFoundError) as e: except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load server songs: {e}") print(f"⚠️ Could not load server songs: {e}")
return {} return set()
def is_song_on_server(server_songs, artist, title): def is_song_on_server(server_songs, artist, title):
@ -49,24 +37,9 @@ def is_song_on_server(server_songs, artist, title):
return key in server_songs return key in server_songs
def should_skip_server_song(server_songs, artist, title):
"""Check if a song should be skipped because it's already available as MP4 on server.
Returns True if the song should be skipped (MP4 format), False if it should be downloaded (MP3/CDG format)."""
key = f"{artist.lower()}_{normalize_title(title)}"
if key not in server_songs:
return False # Not on server, so don't skip
song_info = server_songs[key]
# Skip if it's an MP4 file (video format)
# Don't skip if it's MP3 or in CDG folder (different format)
return song_info.get("is_mp4", False) and not song_info.get("is_cdg", False)
def load_server_duplicates_tracking( def load_server_duplicates_tracking(
tracking_path=None, tracking_path="data/server_duplicates_tracking.json",
): ):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_server_duplicates_tracking_path())
"""Load the tracking of songs found to be duplicates on the server.""" """Load the tracking of songs found to be duplicates on the server."""
tracking_file = Path(tracking_path) tracking_file = Path(tracking_path)
if not tracking_file.exists(): if not tracking_file.exists():
@ -80,10 +53,8 @@ def load_server_duplicates_tracking(
def save_server_duplicates_tracking( def save_server_duplicates_tracking(
tracking, tracking_path=None tracking, tracking_path="data/server_duplicates_tracking.json"
): ):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_server_duplicates_tracking_path())
"""Save the tracking of songs found to be duplicates on the server.""" """Save the tracking of songs found to be duplicates on the server."""
try: try:
with open(tracking_path, "w", encoding="utf-8") as f: with open(tracking_path, "w", encoding="utf-8") as f:
@ -115,9 +86,8 @@ def mark_song_as_server_duplicate(tracking, artist, title, video_title, channel_
def check_and_mark_server_duplicate( def check_and_mark_server_duplicate(
server_songs, server_duplicates_tracking, artist, title, video_title, channel_name server_songs, server_duplicates_tracking, artist, title, video_title, channel_name
): ):
"""Check if a song should be skipped because it's already available as MP4 on server and mark it as duplicate if so. """Check if a song is on server and mark it as duplicate if so. Returns True if it's a duplicate."""
Returns True if it should be skipped (MP4 format), False if it should be downloaded (MP3/CDG format).""" if is_song_on_server(server_songs, artist, title):
if should_skip_server_song(server_songs, artist, title):
if not is_song_marked_as_server_duplicate( if not is_song_marked_as_server_duplicate(
server_duplicates_tracking, artist, title server_duplicates_tracking, artist, title
): ):

View File

@ -35,7 +35,6 @@ class SongValidator:
video_title: Optional[str] = None, video_title: Optional[str] = None,
server_songs: Optional[Dict[str, Any]] = None, server_songs: Optional[Dict[str, Any]] = None,
server_duplicates_tracking: Optional[Dict[str, Any]] = None, server_duplicates_tracking: Optional[Dict[str, Any]] = None,
force_download: bool = False,
) -> Tuple[bool, Optional[str], int]: ) -> Tuple[bool, Optional[str], int]:
""" """
Check if a song should be skipped based on multiple criteria. Check if a song should be skipped based on multiple criteria.
@ -54,15 +53,10 @@ class SongValidator:
video_title: YouTube video title (optional) video_title: YouTube video title (optional)
server_songs: Server songs data (optional) server_songs: Server songs data (optional)
server_duplicates_tracking: Server duplicates tracking (optional) server_duplicates_tracking: Server duplicates tracking (optional)
force_download: If True, bypass all validation checks and force download
Returns: Returns:
Tuple of (should_skip, reason, total_filtered) Tuple of (should_skip, reason, total_filtered)
""" """
# If force download is enabled, skip all validation checks
if force_download:
return False, None, 0
total_filtered = 0 total_filtered = 0
# Check 1: Already downloaded by this system # Check 1: Already downloaded by this system

View File

@ -1,265 +0,0 @@
import json
import os
from pathlib import Path
from typing import List, Dict, Any, Optional
from mutagen.mp4 import MP4
from karaoke_downloader.data_path_manager import get_data_path_manager
class SongListGenerator:
"""Utility class for generating song lists from MP4 files with ID3 tags."""
def __init__(self, songlist_path: str = None):
if songlist_path is None:
songlist_path = str(get_data_path_manager().get_songlist_path())
self.songlist_path = Path(songlist_path)
self.songlist_path.parent.mkdir(parents=True, exist_ok=True)
def read_existing_songlist(self) -> List[Dict[str, Any]]:
"""Read existing song list from JSON file."""
if self.songlist_path.exists():
try:
with open(self.songlist_path, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, IOError) as e:
print(f"⚠️ Warning: Could not read existing songlist: {e}")
return []
return []
def save_songlist(self, songlist: List[Dict[str, Any]]) -> None:
"""Save song list to JSON file."""
try:
with open(self.songlist_path, 'w', encoding='utf-8') as f:
json.dump(songlist, f, indent=2, ensure_ascii=False)
print(f"✅ Song list saved to {self.songlist_path}")
except IOError as e:
print(f"❌ Error saving song list: {e}")
raise
def extract_id3_tags(self, mp4_path: Path) -> Optional[Dict[str, str]]:
"""Extract ID3 tags from MP4 file."""
try:
mp4 = MP4(str(mp4_path))
# Extract artist and title from ID3 tags
artist = mp4.get("\xa9ART", ["Unknown Artist"])[0] if "\xa9ART" in mp4 else "Unknown Artist"
title = mp4.get("\xa9nam", ["Unknown Title"])[0] if "\xa9nam" in mp4 else "Unknown Title"
return {
"artist": artist,
"title": title
}
except Exception as e:
print(f"⚠️ Warning: Could not extract ID3 tags from {mp4_path.name}: {e}")
return None
def scan_directory_for_mp4_files(self, directory_path: str) -> List[Path]:
"""Scan directory for MP4 files."""
directory = Path(directory_path)
if not directory.exists():
raise FileNotFoundError(f"Directory not found: {directory_path}")
if not directory.is_dir():
raise ValueError(f"Path is not a directory: {directory_path}")
mp4_files = list(directory.glob("*.mp4"))
if not mp4_files:
print(f"⚠️ No MP4 files found in {directory_path}")
return []
print(f"📁 Found {len(mp4_files)} MP4 files in {directory.name}")
return sorted(mp4_files)
def generate_songlist_from_directory(self, directory_path: str, append: bool = True) -> Dict[str, Any]:
"""Generate a song list from MP4 files in a directory."""
directory = Path(directory_path)
directory_name = directory.name
# Scan for MP4 files
mp4_files = self.scan_directory_for_mp4_files(directory_path)
if not mp4_files:
return {}
# Extract ID3 tags and create songs list
songs = []
for index, mp4_file in enumerate(mp4_files, start=1):
id3_tags = self.extract_id3_tags(mp4_file)
if id3_tags:
song = {
"position": index,
"title": id3_tags["title"],
"artist": id3_tags["artist"]
}
songs.append(song)
print(f" {index:3d}. {id3_tags['artist']} - {id3_tags['title']}")
if not songs:
print("❌ No valid ID3 tags found in any MP4 files")
return {}
# Create the song list entry
songlist_entry = {
"title": directory_name,
"songs": songs
}
# Handle appending to existing song list
if append:
existing_songlist = self.read_existing_songlist()
# Check if a playlist with this title already exists
existing_index = None
for i, entry in enumerate(existing_songlist):
if entry.get("title") == directory_name:
existing_index = i
break
if existing_index is not None:
# Replace existing entry
print(f"🔄 Replacing existing playlist: {directory_name}")
existing_songlist[existing_index] = songlist_entry
else:
# Add new entry to the beginning of the list
print(f" Adding new playlist: {directory_name}")
existing_songlist.insert(0, songlist_entry)
self.save_songlist(existing_songlist)
else:
# Create new song list with just this entry
print(f"📝 Creating new song list with playlist: {directory_name}")
self.save_songlist([songlist_entry])
return songlist_entry
def generate_songlist_from_multiple_directories(self, directory_paths: List[str], append: bool = True) -> List[Dict[str, Any]]:
"""Generate song lists from multiple directories."""
results = []
errors = []
# Read existing song list once at the beginning
existing_songlist = self.read_existing_songlist() if append else []
for directory_path in directory_paths:
try:
print(f"\n📂 Processing directory: {directory_path}")
directory = Path(directory_path)
directory_name = directory.name
# Scan for MP4 files
mp4_files = self.scan_directory_for_mp4_files(directory_path)
if not mp4_files:
continue
# Extract ID3 tags and create songs list
songs = []
for index, mp4_file in enumerate(mp4_files, start=1):
id3_tags = self.extract_id3_tags(mp4_file)
if id3_tags:
song = {
"position": index,
"title": id3_tags["title"],
"artist": id3_tags["artist"]
}
songs.append(song)
print(f" {index:3d}. {id3_tags['artist']} - {id3_tags['title']}")
if not songs:
print("❌ No valid ID3 tags found in any MP4 files")
continue
# Create the song list entry
songlist_entry = {
"title": directory_name,
"songs": songs
}
# Check if a playlist with this title already exists
existing_index = None
for i, entry in enumerate(existing_songlist):
if entry.get("title") == directory_name:
existing_index = i
break
if existing_index is not None:
# Replace existing entry
print(f"🔄 Replacing existing playlist: {directory_name}")
existing_songlist[existing_index] = songlist_entry
else:
# Add new entry to the beginning of the list
print(f" Adding new playlist: {directory_name}")
existing_songlist.insert(0, songlist_entry)
results.append(songlist_entry)
except Exception as e:
error_msg = f"Error processing {directory_path}: {e}"
print(f"{error_msg}")
errors.append(error_msg)
# Save the final song list
if results:
if append:
# Save the updated existing song list
self.save_songlist(existing_songlist)
else:
# Create new song list with just the results
self.save_songlist(results)
# If there were any errors, raise an exception
if errors:
raise Exception(f"Failed to process {len(errors)} directories: {'; '.join(errors)}")
return results
def main():
"""CLI entry point for song list generation."""
import argparse
import sys
parser = argparse.ArgumentParser(
description="Generate song lists from MP4 files with ID3 tags",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python -m karaoke_downloader.songlist_generator /path/to/mp4/directory
python -m karaoke_downloader.songlist_generator /path/to/dir1 /path/to/dir2 --no-append
python -m karaoke_downloader.songlist_generator /path/to/dir --songlist-path custom_songlist.json
"""
)
parser.add_argument(
"directories",
nargs="+",
help="Directory paths containing MP4 files with ID3 tags"
)
parser.add_argument(
"--no-append",
action="store_true",
help="Create a new song list instead of appending to existing one"
)
parser.add_argument(
"--songlist-path",
default=None,
help="Path to the song list JSON file (default: songList.json in the data directory)"
)
args = parser.parse_args()
try:
generator = SongListGenerator(args.songlist_path)
generator.generate_songlist_from_multiple_directories(
args.directories,
append=not args.no_append
)
print("\n✅ Song list generation completed successfully!")
except Exception as e:
print(f"\n❌ Error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -7,7 +7,6 @@ import json
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
from karaoke_downloader.server_manager import ( from karaoke_downloader.server_manager import (
check_and_mark_server_duplicate, check_and_mark_server_duplicate,
is_song_marked_as_server_duplicate, is_song_marked_as_server_duplicate,
@ -17,9 +16,7 @@ from karaoke_downloader.server_manager import (
) )
def load_songlist(songlist_path=None): def load_songlist(songlist_path="data/songList.json"):
if songlist_path is None:
songlist_path = str(get_data_path_manager().get_songlist_path())
songlist_file = Path(songlist_path) songlist_file = Path(songlist_path)
if not songlist_file.exists(): if not songlist_file.exists():
print(f"⚠️ Songlist file not found: {songlist_path}") print(f"⚠️ Songlist file not found: {songlist_path}")
@ -58,9 +55,7 @@ def normalize_title(title):
return " ".join(normalized.split()).lower() return " ".join(normalized.split()).lower()
def load_songlist_tracking(tracking_path=None): def load_songlist_tracking(tracking_path="data/songlist_tracking.json"):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
tracking_file = Path(tracking_path) tracking_file = Path(tracking_path)
if not tracking_file.exists(): if not tracking_file.exists():
return {} return {}
@ -72,9 +67,7 @@ def load_songlist_tracking(tracking_path=None):
return {} return {}
def save_songlist_tracking(tracking, tracking_path=None): def save_songlist_tracking(tracking, tracking_path="data/songlist_tracking.json"):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
try: try:
with open(tracking_path, "w", encoding="utf-8") as f: with open(tracking_path, "w", encoding="utf-8") as f:
json.dump(tracking, f, indent=2, ensure_ascii=False) json.dump(tracking, f, indent=2, ensure_ascii=False)

View File

@ -1,12 +1,10 @@
import json import threading
import os
import re
from datetime import datetime, timedelta
from enum import Enum from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from karaoke_downloader.data_path_manager import get_data_path_manager import json
from datetime import datetime
from pathlib import Path
class SongStatus(str, Enum): class SongStatus(str, Enum):
NOT_DOWNLOADED = "NOT_DOWNLOADED" NOT_DOWNLOADED = "NOT_DOWNLOADED"
@ -27,133 +25,46 @@ class FormatType(str, Enum):
class TrackingManager: class TrackingManager:
def __init__( def __init__(
self, self,
tracking_file=None, tracking_file="data/karaoke_tracking.json",
cache_dir=None, cache_file="data/channel_cache.json",
): ):
if tracking_file is None:
tracking_file = str(get_data_path_manager().get_karaoke_tracking_path())
if cache_dir is None:
cache_dir = str(get_data_path_manager().get_channel_cache_dir())
self.tracking_file = Path(tracking_file) self.tracking_file = Path(tracking_file)
self.cache_dir = Path(cache_dir) self.cache_file = Path(cache_file)
self.data = {"playlists": {}, "songs": {}}
# Ensure cache directory exists self.cache = {}
self.cache_dir.mkdir(parents=True, exist_ok=True) self._lock = threading.Lock()
self._load()
self.data = self._load() self._load_cache()
print(f"📊 Tracking manager initialized with {len(self.data.get('songs', {}))} tracked songs")
def _load(self): def _load(self):
"""Load tracking data from JSON file."""
if self.tracking_file.exists(): if self.tracking_file.exists():
try: try:
with open(self.tracking_file, "r", encoding="utf-8") as f: with open(self.tracking_file, "r", encoding="utf-8") as f:
return json.load(f) self.data = json.load(f)
except json.JSONDecodeError: except Exception:
print(f"⚠️ Corrupted tracking file, creating new one") self.data = {"playlists": {}, "songs": {}}
return {"songs": {}, "playlists": {}, "last_updated": datetime.now().isoformat()}
def _save(self): def _save(self):
"""Save tracking data to JSON file.""" with self._lock:
self.data["last_updated"] = datetime.now().isoformat()
self.tracking_file.parent.mkdir(parents=True, exist_ok=True)
with open(self.tracking_file, "w", encoding="utf-8") as f: with open(self.tracking_file, "w", encoding="utf-8") as f:
json.dump(self.data, f, indent=2, ensure_ascii=False) json.dump(self.data, f, indent=2, ensure_ascii=False)
def force_save(self): def force_save(self):
"""Force save the tracking data."""
self._save() self._save()
def _get_channel_cache_file(self, channel_id: str) -> Path: def _load_cache(self):
"""Get the cache file path for a specific channel.""" if self.cache_file.exists():
# Sanitize channel ID for filename
safe_channel_id = re.sub(r'[<>:"/\\|?*]', '_', channel_id)
return self.cache_dir / f"{safe_channel_id}.json"
def _load_channel_cache(self, channel_id: str) -> List[Dict[str, str]]:
"""Load cache for a specific channel."""
cache_file = self._get_channel_cache_file(channel_id)
if cache_file.exists():
try: try:
with open(cache_file, 'r', encoding='utf-8') as f: with open(self.cache_file, "r", encoding="utf-8") as f:
data = json.load(f) self.cache = json.load(f)
return data.get('videos', []) except Exception:
except (json.JSONDecodeError, KeyError): self.cache = {}
print(f" ⚠️ Corrupted cache file for {channel_id}, will recreate")
return []
return []
def _save_channel_cache(self, channel_id: str, videos: List[Dict[str, str]]): def save_cache(self):
"""Save cache for a specific channel.""" with open(self.cache_file, "w", encoding="utf-8") as f:
cache_file = self._get_channel_cache_file(channel_id) json.dump(self.cache, f, indent=2, ensure_ascii=False)
data = {
'channel_id': channel_id,
'videos': videos,
'last_updated': datetime.now().isoformat(),
'video_count': len(videos)
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
def _clear_channel_cache(self, channel_id: str):
"""Clear cache for a specific channel."""
cache_file = self._get_channel_cache_file(channel_id)
if cache_file.exists():
cache_file.unlink()
print(f" 🗑️ Cleared cache file: {cache_file.name}")
def get_cache_info(self):
"""Get information about all channel cache files."""
cache_files = list(self.cache_dir.glob("*.json"))
total_videos = 0
cache_info = []
for cache_file in cache_files:
try:
with open(cache_file, 'r', encoding='utf-8') as f:
data = json.load(f)
video_count = len(data.get('videos', []))
total_videos += video_count
last_updated = data.get('last_updated', 'Unknown')
cache_info.append({
'channel': data.get('channel_id', cache_file.stem),
'videos': video_count,
'last_updated': last_updated,
'file': cache_file.name
})
except Exception as e:
print(f"⚠️ Error reading cache file {cache_file.name}: {e}")
return {
'total_channels': len(cache_files),
'total_videos': total_videos,
'channels': cache_info
}
def clear_channel_cache(self, channel_id=None):
"""Clear cache for a specific channel or all channels."""
if channel_id:
self._clear_channel_cache(channel_id)
print(f"🗑️ Cleared cache for channel: {channel_id}")
else:
# Clear all cache files
cache_files = list(self.cache_dir.glob("*.json"))
for cache_file in cache_files:
cache_file.unlink()
print(f"🗑️ Cleared all {len(cache_files)} channel cache files")
def set_cache_duration(self, hours):
"""Placeholder for cache duration logic"""
pass
def export_playlist_report(self, playlist_id):
"""Export a report for a specific playlist."""
pass
def get_statistics(self): def get_statistics(self):
"""Get statistics about tracked songs."""
total_songs = len(self.data["songs"]) total_songs = len(self.data["songs"])
downloaded_songs = sum( downloaded_songs = sum(
1 1
@ -191,13 +102,11 @@ class TrackingManager:
} }
def get_playlist_songs(self, playlist_id): def get_playlist_songs(self, playlist_id):
"""Get songs for a specific playlist."""
return [ return [
s for s in self.data["songs"].values() if s["playlist_id"] == playlist_id s for s in self.data["songs"].values() if s["playlist_id"] == playlist_id
] ]
def get_failed_songs(self, playlist_id=None): def get_failed_songs(self, playlist_id=None):
"""Get failed songs, optionally filtered by playlist."""
if playlist_id: if playlist_id:
return [ return [
s s
@ -209,7 +118,6 @@ class TrackingManager:
] ]
def get_partial_downloads(self, playlist_id=None): def get_partial_downloads(self, playlist_id=None):
"""Get partial downloads, optionally filtered by playlist."""
if playlist_id: if playlist_id:
return [ return [
s s
@ -221,7 +129,7 @@ class TrackingManager:
] ]
def cleanup_orphaned_files(self, downloads_dir): def cleanup_orphaned_files(self, downloads_dir):
"""Remove tracking entries for files that no longer exist.""" # Remove tracking entries for files that no longer exist
orphaned = [] orphaned = []
for song_id, song in list(self.data["songs"].items()): for song_id, song in list(self.data["songs"].items()):
file_path = song.get("file_path") file_path = song.get("file_path")
@ -231,17 +139,51 @@ class TrackingManager:
self.force_save() self.force_save()
return orphaned return orphaned
def get_cache_info(self):
total_channels = len(self.cache)
total_cached_videos = sum(len(v) for v in self.cache.values())
cache_duration_hours = 24 # default
last_updated = None
return {
"total_channels": total_channels,
"total_cached_videos": total_cached_videos,
"cache_duration_hours": cache_duration_hours,
"last_updated": last_updated,
}
def clear_channel_cache(self, channel_id=None):
if channel_id is None or channel_id == "all":
self.cache = {}
else:
self.cache.pop(channel_id, None)
self.save_cache()
def set_cache_duration(self, hours):
# Placeholder for cache duration logic
pass
def export_playlist_report(self, playlist_id):
playlist = self.data["playlists"].get(playlist_id)
if not playlist:
return f"Playlist '{playlist_id}' not found."
songs = self.get_playlist_songs(playlist_id)
report = {"playlist": playlist, "songs": songs}
return json.dumps(report, indent=2, ensure_ascii=False)
def is_song_downloaded(self, artist, title, channel_name=None, video_id=None): def is_song_downloaded(self, artist, title, channel_name=None, video_id=None):
""" """
Check if a song has already been downloaded. Check if a song has already been downloaded by this system.
Returns True if the song exists in tracking with DOWNLOADED status. Returns True if the song exists in tracking with DOWNLOADED or CONVERTED status.
""" """
# If we have video_id and channel_name, try direct key lookup first (most efficient) # If we have video_id and channel_name, try direct key lookup first (most efficient)
if video_id and channel_name: if video_id and channel_name:
song_key = f"{video_id}@{channel_name}" song_key = f"{video_id}@{channel_name}"
if song_key in self.data["songs"]: if song_key in self.data["songs"]:
song_data = self.data["songs"][song_key] song_data = self.data["songs"][song_key]
if song_data.get("status") == SongStatus.DOWNLOADED: if song_data.get("status") in [
SongStatus.DOWNLOADED,
SongStatus.CONVERTED,
]:
return True return True
# Fallback to content search (for cases where we don't have video_id) # Fallback to content search (for cases where we don't have video_id)
@ -249,14 +191,19 @@ class TrackingManager:
# Check if this song matches the artist and title # Check if this song matches the artist and title
if song_data.get("artist") == artist and song_data.get("title") == title: if song_data.get("artist") == artist and song_data.get("title") == title:
# Check if it's marked as downloaded # Check if it's marked as downloaded
if song_data.get("status") == SongStatus.DOWNLOADED: if song_data.get("status") in [
SongStatus.DOWNLOADED,
SongStatus.CONVERTED,
]:
return True return True
# Also check the video title field which might contain the song info # Also check the video title field which might contain the song info
video_title = song_data.get("video_title", "") video_title = song_data.get("video_title", "")
if video_title and artist in video_title and title in video_title: if video_title and artist in video_title and title in video_title:
if song_data.get("status") == SongStatus.DOWNLOADED: if song_data.get("status") in [
SongStatus.DOWNLOADED,
SongStatus.CONVERTED,
]:
return True return True
return False return False
def is_file_exists(self, file_path): def is_file_exists(self, file_path):
@ -336,359 +283,114 @@ class TrackingManager:
self._save() self._save()
def get_channel_video_list( def get_channel_video_list(
self, channel_url, yt_dlp_path="downloader/yt-dlp.exe", force_refresh=False, show_pagination=False self, channel_url, yt_dlp_path=None, force_refresh=False
): ):
""" """
Return a list of videos (dicts with 'title' and 'id') for the channel, using cache if available unless force_refresh is True. Return a list of videos (dicts with 'title' and 'id') for the channel, using cache if available unless force_refresh is True.
Args:
channel_url: YouTube channel URL
yt_dlp_path: Path to yt-dlp executable
force_refresh: Force refresh cache even if available
show_pagination: Show page-by-page progress (slower but more detailed)
""" """
# Use platform-aware path if none provided
if yt_dlp_path is None:
from karaoke_downloader.config_manager import load_config
config = load_config()
yt_dlp_path = config.yt_dlp_path
channel_name, channel_id = None, None channel_name, channel_id = None, None
# Check if this is a manual channel
from karaoke_downloader.manual_video_manager import is_manual_channel, get_manual_channel_info, get_manual_videos_for_channel
if is_manual_channel(channel_url):
channel_name, channel_id = get_manual_channel_info(channel_url)
if channel_name and channel_id:
print(f" 📋 Loading manual videos for {channel_name}")
manual_videos = get_manual_videos_for_channel(channel_name)
# Convert to the expected format
videos = []
for video in manual_videos:
videos.append({
"title": video.get("title", ""),
"id": video.get("id", ""),
"url": video.get("url", "")
})
print(f" ✅ Loaded {len(videos)} manual videos")
return videos
else:
print(f" ❌ Could not get manual channel info for: {channel_url}")
return []
# Regular YouTube channel processing
from karaoke_downloader.youtube_utils import get_channel_info from karaoke_downloader.youtube_utils import get_channel_info
channel_name, channel_id = get_channel_info(channel_url) channel_name, channel_id = get_channel_info(channel_url)
if not channel_id: # Check if cache has the old flat structure or new nested structure
print(f" ❌ Could not extract channel ID from URL: {channel_url}") cache_data = None
return [] cache_key = None
print(f" 🔍 Channel: {channel_name} (ID: {channel_id})") # Try nested structure first (new format)
if "channels" in self.cache:
# Check if we have cached data for this channel # Try multiple possible cache keys in nested structure
if not force_refresh: possible_keys = [
cached_videos = self._load_channel_cache(channel_id) channel_id, # The extracted channel ID
if cached_videos: channel_url, # The full URL
# Validate that the cached data has proper video IDs channel_name, # The extracted channel name
corrupted = False
# Check if any video IDs look like titles instead of proper YouTube IDs
for video in cached_videos[:20]: # Check first 20 videos
video_id = video.get("id", "")
# More comprehensive validation - YouTube IDs should be 11 characters and contain only alphanumeric, hyphens, and underscores
if video_id and (
len(video_id) != 11 or
not video_id.replace('-', '').replace('_', '').isalnum() or
" " in video_id or
"Lyrics" in video_id or
"KARAOKE" in video_id.upper() or
"Vocal" in video_id or
"Guide" in video_id
):
print(f" ⚠️ Detected corrupted video ID in cache: '{video_id}'")
corrupted = True
break
if corrupted:
print(f" 🧹 Clearing corrupted cache for {channel_id}")
self._clear_channel_cache(channel_id)
force_refresh = True
else:
print(f" 📋 Using cached video list ({len(cached_videos)} videos)")
return cached_videos
# Choose fetch method based on show_pagination flag
if show_pagination:
return self._fetch_videos_with_pagination(channel_url, channel_id, yt_dlp_path)
else:
return self._fetch_videos_flat_playlist(channel_url, channel_id, yt_dlp_path)
def _fetch_videos_with_pagination(self, channel_url, channel_id, yt_dlp_path):
"""Fetch videos showing page-by-page progress."""
print(f" 🌐 Fetching video list from YouTube (page-by-page mode)...")
print(f" 📡 Channel URL: {channel_url}")
import subprocess
all_videos = []
page = 1
videos_per_page = 200 # YouTube/yt-dlp supports up to 200 videos per page, reducing API calls and errors
while True:
print(f" 📄 Fetching page {page}...")
# Fetch one page at a time
cmd = [
yt_dlp_path,
"--flat-playlist",
"--print",
"%(title)s|%(id)s|%(url)s",
"--playlist-start",
str((page - 1) * videos_per_page + 1),
"--playlist-end",
str(page * videos_per_page),
channel_url,
] ]
try: for key in possible_keys:
# Increased timeout to 180 seconds for larger pages (200 videos) if key and key in self.cache["channels"]:
result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=180) cache_data = self.cache["channels"][key]["videos"]
lines = result.stdout.strip().splitlines() cache_key = key
# Save raw output for debugging (for each page)
raw_output_file = self._get_channel_cache_file(channel_id).parent / f"{channel_id}_raw_output_page{page}.txt"
try:
with open(raw_output_file, 'w', encoding='utf-8') as f:
f.write(f"# Raw yt-dlp output for {channel_id} - Page {page}\n")
f.write(f"# Channel URL: {channel_url}\n")
f.write(f"# Command: {' '.join(cmd)}\n")
f.write(f"# Timestamp: {datetime.now().isoformat()}\n")
f.write(f"# Total lines: {len(lines)}\n")
f.write("#" * 80 + "\n\n")
for i, line in enumerate(lines, 1):
f.write(f"{i:6d}: {line}\n")
print(f" 💾 Saved raw output to: {raw_output_file.name}")
except Exception as e:
print(f" ⚠️ Could not save raw output: {e}")
if not lines:
print(f" ✅ No more videos found on page {page}")
break break
print(f" 📊 Page {page}: Found {len(lines)} videos") # Try flat structure (old format) as fallback
if cache_data is None:
possible_keys = [
channel_id, # The extracted channel ID
channel_url, # The full URL
channel_name, # The extracted channel name
]
page_videos = [] for key in possible_keys:
invalid_count = 0 if key and key in self.cache:
cache_data = self.cache[key]
cache_key = key
break
for line in lines: if not cache_key:
if not line.strip(): cache_key = channel_id or channel_url # Use as fallback for new entries
continue
# More robust parsing that handles titles with | characters print(f" 🔍 Trying cache keys: {possible_keys}")
# Extract video ID directly from the URL that yt-dlp provides print(f" 🔍 Selected cache key: '{cache_key}'")
# Find the URL and extract video ID from it if not force_refresh and cache_data is not None:
url_match = re.search(r'https://www\.youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})', line) print(
if not url_match: f" 📋 Using cached video list ({len(cache_data)} videos)"
continue )
# Convert old cache format to new format if needed
# Extract video ID directly from the URL converted_videos = []
video_id = url_match.group(1) for video in cache_data:
if "video_id" in video and "id" not in video:
# Extract title (everything before the video ID in the line) # Convert old format to new format
title = line[:line.find(video_id)].rstrip('|').strip() converted_videos.append({
"title": video["title"],
# Validate video ID "id": video["video_id"]
if video_id and ( })
len(video_id) == 11 and
video_id.replace('-', '').replace('_', '').isalnum() and
" " not in video_id and
"Lyrics" not in video_id and
"KARAOKE" not in video_id.upper() and
"Vocal" not in video_id and
"Guide" not in video_id
):
page_videos.append({"title": title, "id": video_id})
else: else:
invalid_count += 1 # Already in new format
if invalid_count <= 3: # Show first 3 invalid IDs per page converted_videos.append(video)
print(f" ⚠️ Invalid ID: '{video_id}' for '{title[:50]}...'") return converted_videos
else:
if invalid_count > 3: print(f" ❌ Cache miss for all keys")
print(f" ⚠️ ... and {invalid_count - 3} more invalid IDs on this page")
all_videos.extend(page_videos)
print(f" ✅ Page {page}: Added {len(page_videos)} valid videos (total: {len(all_videos)})")
# If we got fewer videos than expected, we're probably at the end
if len(lines) < videos_per_page:
print(f" 🏁 Reached end of channel (last page had {len(lines)} videos)")
break
page += 1
# Safety check to prevent infinite loops
if page > 50: # Max 50 pages (10,000 videos with 200 per page)
print(f" ⚠️ Reached maximum page limit (50 pages), stopping")
break
except subprocess.TimeoutExpired:
print(f" ⚠️ Page {page} timed out, stopping")
break
except subprocess.CalledProcessError as e:
print(f" ❌ Error fetching page {page}: {e}")
break
except KeyboardInterrupt:
print(f" ⏹️ User interrupted, stopping at page {page}")
break
if not all_videos:
print(f" ❌ No valid videos found")
return []
print(f" 🎉 Channel download complete!")
print(f" 📊 Total videos fetched: {len(all_videos)}")
# Save to individual channel cache file
self._save_channel_cache(channel_id, all_videos)
print(f" 💾 Saved cache to: {self._get_channel_cache_file(channel_id).name}")
return all_videos
def _fetch_videos_flat_playlist(self, channel_url, channel_id, yt_dlp_path):
"""Fetch all videos using flat playlist (faster but less detailed progress)."""
# Fetch with yt-dlp # Fetch with yt-dlp
print(f" 🌐 Fetching video list from YouTube (this may take a while)...") print(f" 🌐 Fetching video list from YouTube (this may take a while)...")
print(f" 📡 Channel URL: {channel_url}")
import subprocess import subprocess
from karaoke_downloader.youtube_utils import _parse_yt_dlp_command from karaoke_downloader.youtube_utils import _parse_yt_dlp_command
# First, let's get the total count to show progress
count_cmd = _parse_yt_dlp_command(yt_dlp_path) + [
"--flat-playlist",
"--print",
"%(title)s",
"--playlist-end",
"1", # Just get first video to test
channel_url,
]
try:
print(f" 🔍 Testing channel access...")
test_result = subprocess.run(count_cmd, capture_output=True, text=True, timeout=30)
if test_result.returncode == 0:
print(f" ✅ Channel is accessible")
else:
print(f" ⚠️ Channel test failed: {test_result.stderr}")
except subprocess.TimeoutExpired:
print(f" ⚠️ Channel test timed out")
except Exception as e:
print(f" ⚠️ Channel test error: {e}")
# Now fetch all videos with progress indicators
cmd = _parse_yt_dlp_command(yt_dlp_path) + [ cmd = _parse_yt_dlp_command(yt_dlp_path) + [
"--flat-playlist", "--flat-playlist",
"--print", "--print",
"%(title)s|%(id)s|%(url)s", "%(title)s|%(id)s|%(url)s",
"--verbose", # Add verbose output to see what's happening
channel_url, channel_url,
] ]
try: try:
print(f" 🔧 Running yt-dlp command: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True, check=True)
print(f" 📥 Starting video list download...")
# Use a timeout and show progress
result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=300)
lines = result.stdout.strip().splitlines() lines = result.stdout.strip().splitlines()
# Save raw output for debugging
raw_output_file = self._get_channel_cache_file(channel_id).parent / f"{channel_id}_raw_output.txt"
try:
with open(raw_output_file, 'w', encoding='utf-8') as f:
f.write(f"# Raw yt-dlp output for {channel_id}\n")
f.write(f"# Channel URL: {channel_url}\n")
f.write(f"# Command: {' '.join(cmd)}\n")
f.write(f"# Timestamp: {datetime.now().isoformat()}\n")
f.write(f"# Total lines: {len(lines)}\n")
f.write("#" * 80 + "\n\n")
for i, line in enumerate(lines, 1):
f.write(f"{i:6d}: {line}\n")
print(f" 💾 Saved raw output to: {raw_output_file.name}")
except Exception as e:
print(f" ⚠️ Could not save raw output: {e}")
print(f" 📄 Raw output lines: {len(lines)}")
print(f" 📊 Download completed successfully!")
# Show some sample lines to understand the format
if lines:
print(f" 📋 Sample output format:")
for i, line in enumerate(lines[:3]):
print(f" Line {i+1}: {line[:100]}...")
if len(lines) > 3:
print(f" ... and {len(lines) - 3} more lines")
videos = [] videos = []
invalid_count = 0 for line in lines:
parts = line.split("|")
print(f" 🔍 Processing {len(lines)} video entries...") if len(parts) >= 2:
title, video_id = parts[0].strip(), parts[1].strip()
for i, line in enumerate(lines):
if i % 1000 == 0 and i > 0: # Progress indicator every 1000 lines
print(f" 📊 Processing line {i}/{len(lines)}... ({i/len(lines)*100:.1f}%)")
# More robust parsing that handles titles with | characters
# Extract video ID directly from the URL that yt-dlp provides
# Find the URL and extract video ID from it
url_match = re.search(r'https://www\.youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})', line)
if not url_match:
invalid_count += 1
if invalid_count <= 5:
print(f" ⚠️ Skipping line with no URL: '{line[:100]}...'")
elif invalid_count == 6:
print(f" ⚠️ ... and {len(lines) - i - 1} more invalid lines")
continue
# Extract video ID directly from the URL
video_id = url_match.group(1)
# Extract title (everything before the video ID in the line)
title = line[:line.find(video_id)].rstrip('|').strip()
# Validate video ID
if video_id and (
len(video_id) == 11 and
video_id.replace('-', '').replace('_', '').isalnum() and
" " not in video_id and
"Lyrics" not in video_id and
"KARAOKE" not in video_id.upper() and
"Vocal" not in video_id and
"Guide" not in video_id
):
videos.append({"title": title, "id": video_id}) videos.append({"title": title, "id": video_id})
else:
invalid_count += 1
if invalid_count <= 5: # Only show first 5 invalid IDs
print(f" ⚠️ Skipping invalid video ID: '{video_id}' for title: '{title[:50]}...'")
elif invalid_count == 6:
print(f" ⚠️ ... and {len(lines) - i - 1} more invalid IDs")
if not videos: # Save in nested structure format
print(f" ❌ No valid videos found after parsing") if "channels" not in self.cache:
return [] self.cache["channels"] = {}
print(f" ✅ Parsed {len(videos)} valid videos from YouTube") self.cache["channels"][cache_key] = {
print(f" ⚠️ Skipped {invalid_count} invalid video IDs") "videos": videos,
"last_updated": datetime.now().isoformat(),
# Save to individual channel cache file "channel_name": channel_name,
self._save_channel_cache(channel_id, videos) "channel_id": channel_id
print(f" 💾 Saved cache to: {self._get_channel_cache_file(channel_id).name}") }
self.save_cache()
return videos return videos
except subprocess.TimeoutExpired:
print(f"❌ yt-dlp timed out after 5 minutes - channel may be too large")
return []
except subprocess.CalledProcessError as e: except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed to fetch playlist for cache: {e}") print(f"❌ yt-dlp failed to fetch playlist for cache: {e}")
print(f" 📄 stderr: {e.stderr}")
return [] return []

View File

@ -107,10 +107,6 @@ def download_single_video(
video_url = f"https://www.youtube.com/watch?v={video_id}" video_url = f"https://www.youtube.com/watch?v={video_id}"
# Debug: Show the video_id and URL being used
print(f"🔍 DEBUG: video_id = '{video_id}'")
print(f"🔍 DEBUG: video_url = '{video_url}'")
# Build command using centralized utility # Build command using centralized utility
cmd = build_yt_dlp_command(yt_dlp_path, video_url, output_path, config) cmd = build_yt_dlp_command(yt_dlp_path, video_url, output_path, config)
@ -259,7 +255,7 @@ def execute_download_plan(
video_id = item["video_id"] video_id = item["video_id"]
video_title = item["video_title"] video_title = item["video_title"]
print(f"\n⬇️ Downloading {downloaded_count + 1} of {total_to_download}:") print(f"\n⬇️ Downloading {len(download_plan) - idx} of {total_to_download}:")
print(f" 📋 Songlist: {artist} - {title}") print(f" 📋 Songlist: {artist} - {title}")
print(f" 🎬 Video: {video_title} ({channel_name})") print(f" 🎬 Video: {video_title} ({channel_name})")
if "match_score" in item: if "match_score" in item:

View File

@ -23,9 +23,15 @@ def _parse_yt_dlp_command(yt_dlp_path: str) -> List[str]:
def get_channel_info( def get_channel_info(
channel_url: str, yt_dlp_path: str = "downloader/yt-dlp.exe" channel_url: str, yt_dlp_path: str = None
) -> tuple[str, str]: ) -> tuple[str, str]:
"""Get channel information using yt-dlp. Returns (channel_name, channel_id).""" """Get channel information using yt-dlp. Returns (channel_name, channel_id)."""
# Use platform-aware path if none provided
if yt_dlp_path is None:
from karaoke_downloader.config_manager import load_config
config = load_config()
yt_dlp_path = config.yt_dlp_path
try: try:
# Extract channel name from URL for now (faster than calling yt-dlp) # Extract channel name from URL for now (faster than calling yt-dlp)
if "/@" in channel_url: if "/@" in channel_url:
@ -52,9 +58,15 @@ def get_channel_info(
def get_playlist_info( def get_playlist_info(
playlist_url: str, yt_dlp_path: str = "downloader/yt-dlp.exe" playlist_url: str, yt_dlp_path: str = None
) -> List[Dict[str, Any]]: ) -> List[Dict[str, Any]]:
"""Get playlist information using yt-dlp.""" """Get playlist information using yt-dlp."""
# Use platform-aware path if none provided
if yt_dlp_path is None:
from karaoke_downloader.config_manager import load_config
config = load_config()
yt_dlp_path = config.yt_dlp_path
try: try:
cmd = _parse_yt_dlp_command(yt_dlp_path) + ["--dump-json", "--flat-playlist", playlist_url] cmd = _parse_yt_dlp_command(yt_dlp_path) + ["--dump-json", "--flat-playlist", playlist_url]
result = subprocess.run(cmd, capture_output=True, text=True, check=True) result = subprocess.run(cmd, capture_output=True, text=True, check=True)
@ -88,7 +100,7 @@ def build_yt_dlp_command(
Returns: Returns:
List of command arguments for subprocess.run List of command arguments for subprocess.run
""" """
cmd = _parse_yt_dlp_command(yt_dlp_path) + [ cmd = _parse_yt_dlp_command(str(yt_dlp_path)) + [
"--no-check-certificates", "--no-check-certificates",
"--ignore-errors", "--ignore-errors",
"--no-warnings", "--no-warnings",
@ -129,7 +141,7 @@ def execute_yt_dlp_command(
def show_available_formats( def show_available_formats(
video_url: str, yt_dlp_path: str = "downloader/yt-dlp.exe", timeout: int = 30 video_url: str, yt_dlp_path: str = None, timeout: int = 30
) -> None: ) -> None:
""" """
Show available formats for a video (debugging utility). Show available formats for a video (debugging utility).
@ -139,8 +151,14 @@ def show_available_formats(
yt_dlp_path: Path to yt-dlp executable yt_dlp_path: Path to yt-dlp executable
timeout: Timeout in seconds timeout: Timeout in seconds
""" """
# Use platform-aware path if none provided
if yt_dlp_path is None:
from karaoke_downloader.config_manager import load_config
config = load_config()
yt_dlp_path = config.yt_dlp_path
print(f"🔍 Checking available formats for: {video_url}") print(f"🔍 Checking available formats for: {video_url}")
format_cmd = _parse_yt_dlp_command(yt_dlp_path) + ["--list-formats", video_url] format_cmd = _parse_yt_dlp_command(str(yt_dlp_path)) + ["--list-formats", video_url]
try: try:
format_result = subprocess.run( format_result = subprocess.run(
format_cmd, capture_output=True, text=True, timeout=timeout format_cmd, capture_output=True, text=True, timeout=timeout

View File

@ -1,220 +0,0 @@
#!/usr/bin/env python3
"""
macOS setup script for Karaoke Video Downloader.
This script helps users set up yt-dlp and FFmpeg on macOS.
"""
import os
import sys
import subprocess
from pathlib import Path
def check_ffmpeg():
"""Check if FFmpeg is installed."""
try:
result = subprocess.run(["ffmpeg", "-version"], capture_output=True, text=True, timeout=10)
return result.returncode == 0
except (subprocess.TimeoutExpired, FileNotFoundError):
return False
def check_yt_dlp():
"""Check if yt-dlp is installed via pip or binary."""
# Check pip installation
try:
result = subprocess.run([sys.executable, "-m", "yt_dlp", "--version"],
capture_output=True, text=True, timeout=10)
if result.returncode == 0:
return True
except (subprocess.TimeoutExpired, subprocess.CalledProcessError):
pass
# Check binary file
binary_path = Path("downloader/yt-dlp_macos")
if binary_path.exists():
try:
result = subprocess.run([str(binary_path), "--version"],
capture_output=True, text=True, timeout=10)
return result.returncode == 0
except (subprocess.TimeoutExpired, subprocess.CalledProcessError):
pass
return False
def install_ffmpeg():
"""Install FFmpeg via Homebrew."""
print("🎬 Installing FFmpeg...")
# Check if Homebrew is installed
try:
subprocess.run(["brew", "--version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
print("❌ Homebrew is not installed. Please install Homebrew first:")
print(" /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"")
return False
try:
print("🍺 Installing FFmpeg via Homebrew...")
result = subprocess.run(["brew", "install", "ffmpeg"],
capture_output=True, text=True, check=True)
print("✅ FFmpeg installed successfully!")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Failed to install FFmpeg: {e}")
return False
def download_yt_dlp_binary():
"""Download yt-dlp binary for macOS."""
print("📥 Downloading yt-dlp binary for macOS...")
# Create downloader directory if it doesn't exist
downloader_dir = Path("downloader")
downloader_dir.mkdir(exist_ok=True)
# Download yt-dlp binary
binary_path = downloader_dir / "yt-dlp_macos"
url = "https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos"
try:
print(f"📡 Downloading from: {url}")
result = subprocess.run(["curl", "-L", "-o", str(binary_path), url],
capture_output=True, text=True, check=True)
# Make it executable
binary_path.chmod(0o755)
print(f"✅ yt-dlp binary downloaded to: {binary_path}")
# Test the binary
test_result = subprocess.run([str(binary_path), "--version"],
capture_output=True, text=True, timeout=10)
if test_result.returncode == 0:
version = test_result.stdout.strip()
print(f"✅ Binary test successful! Version: {version}")
return True
else:
print(f"❌ Binary test failed: {test_result.stderr}")
return False
except subprocess.CalledProcessError as e:
print(f"❌ Failed to download yt-dlp binary: {e}")
return False
except Exception as e:
print(f"❌ Error downloading binary: {e}")
return False
def install_yt_dlp():
"""Install yt-dlp via pip."""
print("📦 Installing yt-dlp...")
try:
result = subprocess.run([sys.executable, "-m", "pip", "install", "yt-dlp"],
capture_output=True, text=True, check=True)
print("✅ yt-dlp installed successfully!")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Failed to install yt-dlp: {e}")
return False
def test_installation():
"""Test the installation."""
print("\n🧪 Testing installation...")
# Test FFmpeg
if check_ffmpeg():
print("✅ FFmpeg is working!")
else:
print("❌ FFmpeg is not working")
return False
# Test yt-dlp
if check_yt_dlp():
print("✅ yt-dlp is working!")
else:
print("❌ yt-dlp is not working")
return False
return True
def main():
print("🍎 macOS Setup for Karaoke Video Downloader")
print("=" * 50)
# Check current status
print("🔍 Checking current installation...")
ffmpeg_installed = check_ffmpeg()
yt_dlp_installed = check_yt_dlp()
print(f"FFmpeg: {'✅ Installed' if ffmpeg_installed else '❌ Not installed'}")
print(f"yt-dlp: {'✅ Installed' if yt_dlp_installed else '❌ Not installed'}")
if ffmpeg_installed and yt_dlp_installed:
print("\n🎉 Everything is already installed and working!")
return
# Install missing components
print("\n🚀 Installing missing components...")
# Install FFmpeg if needed
if not ffmpeg_installed:
print("\n🎬 FFmpeg Installation Options:")
print("1. Install via Homebrew (recommended)")
print("2. Download from ffmpeg.org")
print("3. Skip FFmpeg installation")
choice = input("\nChoose an option (1-3): ").strip()
if choice == "1":
if not install_ffmpeg():
print("❌ FFmpeg installation failed")
return
elif choice == "2":
print("📥 Please download FFmpeg from: https://ffmpeg.org/download.html")
print(" Extract and add to your PATH, then run this script again.")
return
elif choice == "3":
print("⚠️ FFmpeg is required for video processing. Some features may not work.")
else:
print("❌ Invalid choice")
return
# Install yt-dlp if needed
if not yt_dlp_installed:
print("\n📦 yt-dlp Installation Options:")
print("1. Install via pip (recommended)")
print("2. Download binary file")
print("3. Skip yt-dlp installation")
choice = input("\nChoose an option (1-3): ").strip()
if choice == "1":
if not install_yt_dlp():
print("❌ yt-dlp installation failed")
return
elif choice == "2":
if not download_yt_dlp_binary():
print("❌ yt-dlp binary download failed")
return
elif choice == "3":
print("❌ yt-dlp is required for video downloading.")
return
else:
print("❌ Invalid choice")
return
# Test installation
if test_installation():
print("\n🎉 Setup completed successfully!")
print("You can now use the Karaoke Video Downloader on macOS.")
print("Run: python download_karaoke.py --help")
else:
print("\n❌ Setup failed. Please check the error messages above.")
if __name__ == "__main__":
main()

288
setup_platform.py Normal file
View File

@ -0,0 +1,288 @@
#!/usr/bin/env python3
"""
Platform setup script for Karaoke Video Downloader.
This script helps users download the correct yt-dlp binary for their platform.
"""
import os
import platform
import sys
import urllib.request
import zipfile
import tarfile
from pathlib import Path
def detect_platform():
"""Detect the current platform and return platform info."""
system = platform.system().lower()
machine = platform.machine().lower()
if system == "windows":
return "windows", "yt-dlp.exe"
elif system == "darwin":
return "macos", "yt-dlp_macos"
elif system == "linux":
return "linux", "yt-dlp"
else:
return "unknown", "yt-dlp"
def get_download_url(platform_name):
"""Get the download URL for yt-dlp based on platform."""
base_url = "https://github.com/yt-dlp/yt-dlp/releases/latest/download"
if platform_name == "windows":
return f"{base_url}/yt-dlp.exe"
elif platform_name == "macos":
return f"{base_url}/yt-dlp_macos"
elif platform_name == "linux":
return f"{base_url}/yt-dlp"
else:
raise ValueError(f"Unsupported platform: {platform_name}")
def install_via_pip():
"""Install yt-dlp via pip."""
print("📦 Installing yt-dlp via pip...")
try:
import subprocess
result = subprocess.run([sys.executable, "-m", "pip", "install", "yt-dlp"],
capture_output=True, text=True, check=True)
print("✅ yt-dlp installed successfully via pip!")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Failed to install yt-dlp via pip: {e}")
return False
def check_ffmpeg():
"""Check if FFmpeg is installed and available."""
try:
import subprocess
result = subprocess.run(["ffmpeg", "-version"], capture_output=True, text=True, timeout=10)
return result.returncode == 0
except (subprocess.TimeoutExpired, FileNotFoundError):
return False
def install_ffmpeg():
"""Install FFmpeg based on platform."""
import subprocess
platform_name, _ = detect_platform()
print("🎬 Installing FFmpeg...")
if platform_name == "macos":
# Try using Homebrew first
try:
print("🍺 Attempting to install FFmpeg via Homebrew...")
result = subprocess.run(["brew", "install", "ffmpeg"],
capture_output=True, text=True, check=True)
print("✅ FFmpeg installed successfully via Homebrew!")
return True
except (subprocess.CalledProcessError, FileNotFoundError):
print("⚠️ Homebrew not found or failed. Trying alternative methods...")
# Try using MacPorts
try:
print("🍎 Attempting to install FFmpeg via MacPorts...")
result = subprocess.run(["sudo", "port", "install", "ffmpeg"],
capture_output=True, text=True, check=True)
print("✅ FFmpeg installed successfully via MacPorts!")
return True
except (subprocess.CalledProcessError, FileNotFoundError):
print("❌ Could not install FFmpeg automatically.")
print("Please install FFmpeg manually:")
print("1. Install Homebrew: https://brew.sh/")
print("2. Run: brew install ffmpeg")
print("3. Or download from: https://ffmpeg.org/download.html")
return False
elif platform_name == "linux":
try:
print("🐧 Attempting to install FFmpeg via package manager...")
# Try apt (Ubuntu/Debian)
try:
result = subprocess.run(["sudo", "apt", "update"], capture_output=True, text=True, check=True)
result = subprocess.run(["sudo", "apt", "install", "-y", "ffmpeg"],
capture_output=True, text=True, check=True)
print("✅ FFmpeg installed successfully via apt!")
return True
except subprocess.CalledProcessError:
# Try yum (CentOS/RHEL)
try:
result = subprocess.run(["sudo", "yum", "install", "-y", "ffmpeg"],
capture_output=True, text=True, check=True)
print("✅ FFmpeg installed successfully via yum!")
return True
except subprocess.CalledProcessError:
print("❌ Could not install FFmpeg automatically.")
print("Please install FFmpeg manually for your Linux distribution.")
return False
except FileNotFoundError:
print("❌ Could not install FFmpeg automatically.")
print("Please install FFmpeg manually for your Linux distribution.")
return False
elif platform_name == "windows":
print("❌ FFmpeg installation not automated for Windows.")
print("Please install FFmpeg manually:")
print("1. Download from: https://ffmpeg.org/download.html")
print("2. Extract to a folder and add to PATH")
print("3. Or use Chocolatey: choco install ffmpeg")
return False
return False
def download_file(url, destination):
"""Download a file from URL to destination."""
print(f"📥 Downloading from: {url}")
print(f"📁 Saving to: {destination}")
try:
urllib.request.urlretrieve(url, destination)
print("✅ Download completed successfully!")
return True
except Exception as e:
print(f"❌ Download failed: {e}")
return False
def make_executable(file_path):
"""Make a file executable (for Unix-like systems)."""
try:
os.chmod(file_path, 0o755)
print(f"🔧 Made {file_path} executable")
except Exception as e:
print(f"⚠️ Could not make file executable: {e}")
def main():
print("🎤 Karaoke Video Downloader - Platform Setup")
print("=" * 50)
# Detect platform
platform_name, binary_name = detect_platform()
print(f"🖥️ Detected platform: {platform_name}")
print(f"📦 Binary name: {binary_name}")
# Create downloader directory if it doesn't exist
downloader_dir = Path("downloader")
downloader_dir.mkdir(exist_ok=True)
# Check if binary already exists
binary_path = downloader_dir / binary_name
if binary_path.exists():
print(f"{binary_name} already exists in downloader/ directory")
response = input("Do you want to re-download it? (y/N): ").strip().lower()
if response != 'y':
print("Setup completed!")
return
# Offer installation options
print(f"\n🔧 Installation options for {platform_name}:")
print("1. Download binary file (recommended for most users)")
print("2. Install via pip (alternative method)")
choice = input("Choose installation method (1 or 2): ").strip()
if choice == "2":
# Install via pip
if install_via_pip():
print(f"\n✅ yt-dlp installed successfully!")
# Test the installation
print(f"\n🧪 Testing yt-dlp installation...")
try:
import subprocess
result = subprocess.run([sys.executable, "-m", "yt_dlp", "--version"],
capture_output=True, text=True, timeout=10)
if result.returncode == 0:
version = result.stdout.strip()
print(f"✅ yt-dlp is working! Version: {version}")
else:
print(f"⚠️ yt-dlp test failed: {result.stderr}")
except Exception as e:
print(f"⚠️ Could not test yt-dlp: {e}")
# Check and install FFmpeg
print(f"\n🎬 Checking FFmpeg installation...")
if check_ffmpeg():
print(f"✅ FFmpeg is already installed and working!")
else:
print(f"⚠️ FFmpeg not found. Installing...")
if install_ffmpeg():
print(f"✅ FFmpeg installed successfully!")
else:
print(f"⚠️ FFmpeg installation failed. The tool will still work but may be slower.")
print(f"\n🎉 Setup completed successfully!")
print(f"📦 yt-dlp installed via pip")
print(f"🖥️ Platform: {platform_name}")
print(f"\n🎉 You're ready to use the Karaoke Video Downloader!")
print(f"Run: python download_karaoke.py --help")
return
else:
print("❌ Pip installation failed. Trying binary download...")
# Download binary file
try:
download_url = get_download_url(platform_name)
except ValueError as e:
print(f"{e}")
print("Please manually download yt-dlp for your platform from:")
print("https://github.com/yt-dlp/yt-dlp/releases/latest")
return
# Download the binary
print(f"\n🚀 Downloading yt-dlp for {platform_name}...")
if download_file(download_url, binary_path):
# Make executable on Unix-like systems
if platform_name in ["macos", "linux"]:
make_executable(binary_path)
print(f"\n✅ yt-dlp binary downloaded successfully!")
print(f"📁 yt-dlp binary location: {binary_path}")
print(f"🖥️ Platform: {platform_name}")
# Test the binary
print(f"\n🧪 Testing yt-dlp installation...")
try:
import subprocess
result = subprocess.run([str(binary_path), "--version"],
capture_output=True, text=True, timeout=10)
if result.returncode == 0:
version = result.stdout.strip()
print(f"✅ yt-dlp is working! Version: {version}")
else:
print(f"⚠️ yt-dlp test failed: {result.stderr}")
except Exception as e:
print(f"⚠️ Could not test yt-dlp: {e}")
# Check and install FFmpeg
print(f"\n🎬 Checking FFmpeg installation...")
if check_ffmpeg():
print(f"✅ FFmpeg is already installed and working!")
else:
print(f"⚠️ FFmpeg not found. Installing...")
if install_ffmpeg():
print(f"✅ FFmpeg installed successfully!")
else:
print(f"⚠️ FFmpeg installation failed. The tool will still work but may be slower.")
print(f"\n🎉 Setup completed successfully!")
print(f"📁 yt-dlp binary location: {binary_path}")
print(f"🖥️ Platform: {platform_name}")
print(f"\n🎉 You're ready to use the Karaoke Video Downloader!")
print(f"Run: python download_karaoke.py --help")
else:
print(f"\n❌ Setup failed. Please manually download yt-dlp for {platform_name}")
print(f"Download URL: {download_url}")
print(f"Save to: {binary_path}")
if __name__ == "__main__":
main()

View File

@ -1,198 +0,0 @@
#!/usr/bin/env python3
"""
Helper script to add manual videos to the manual videos collection.
"""
import json
import re
from pathlib import Path
from typing import Dict, List, Optional
from karaoke_downloader.data_path_manager import get_data_path_manager
def extract_video_id(url: str) -> Optional[str]:
"""Extract video ID from YouTube URL."""
patterns = [
r'(?:youtube\.com/watch\?v=|youtu\.be/|youtube\.com/embed/)([a-zA-Z0-9_-]{11})',
r'youtube\.com/watch\?.*v=([a-zA-Z0-9_-]{11})'
]
for pattern in patterns:
match = re.search(pattern, url)
if match:
return match.group(1)
return None
def add_manual_video(title: str, url: str, manual_file: str = None):
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""
Add a manual video to the collection.
Args:
title: Video title (e.g., "Artist - Song (Karaoke Version)")
url: YouTube URL
manual_file: Path to manual videos JSON file
"""
manual_path = Path(manual_file)
# Load existing data or create new
if manual_path.exists():
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
else:
data = {
"channel_name": "@ManualVideos",
"channel_url": "manual://static",
"description": "Manual collection of individual karaoke videos",
"videos": [],
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "(Karaoke Version) Lyrics"]
}
}
}
}
# Extract video ID
video_id = extract_video_id(url)
if not video_id:
print(f"❌ Could not extract video ID from URL: {url}")
return False
# Check if video already exists
existing_ids = [video.get("id") for video in data["videos"]]
if video_id in existing_ids:
print(f"⚠️ Video already exists: {title}")
return False
# Add new video
new_video = {
"title": title,
"url": url,
"id": video_id,
"upload_date": "2024-01-01", # Default date
"duration": 180, # Default duration
"view_count": 1000 # Default view count
}
data["videos"].append(new_video)
# Save updated data
manual_path.parent.mkdir(parents=True, exist_ok=True)
with open(manual_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"✅ Added video: {title}")
print(f" URL: {url}")
print(f" ID: {video_id}")
return True
def list_manual_videos(manual_file: str = None):
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""List all manual videos."""
manual_path = Path(manual_file)
if not manual_path.exists():
print("❌ No manual videos file found")
return
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
print(f"📋 Manual Videos ({len(data['videos'])} videos):")
print("=" * 60)
for i, video in enumerate(data['videos'], 1):
print(f"{i:2d}. {video['title']}")
print(f" URL: {video['url']}")
print(f" ID: {video['id']}")
print()
def remove_manual_video(video_id: str, manual_file: str = None):
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""Remove a manual video by ID."""
manual_path = Path(manual_file)
if not manual_path.exists():
print("❌ No manual videos file found")
return False
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
# Find and remove video
for i, video in enumerate(data['videos']):
if video['id'] == video_id:
removed_video = data['videos'].pop(i)
with open(manual_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"✅ Removed video: {removed_video['title']}")
return True
print(f"❌ Video with ID '{video_id}' not found")
return False
def main():
"""Interactive mode for adding manual videos."""
print("🎤 Manual Video Manager")
print("=" * 30)
print("1. Add video")
print("2. List videos")
print("3. Remove video")
print("4. Exit")
while True:
choice = input("\nSelect option (1-4): ").strip()
if choice == "1":
title = input("Enter video title (e.g., 'Artist - Song (Karaoke Version)'): ").strip()
url = input("Enter YouTube URL: ").strip()
if title and url:
add_manual_video(title, url)
else:
print("❌ Title and URL are required")
elif choice == "2":
list_manual_videos()
elif choice == "3":
video_id = input("Enter video ID to remove: ").strip()
if video_id:
remove_manual_video(video_id)
else:
print("❌ Video ID is required")
elif choice == "4":
print("👋 Goodbye!")
break
else:
print("❌ Invalid option")
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
# Command line mode
if sys.argv[1] == "add" and len(sys.argv) >= 4:
add_manual_video(sys.argv[2], sys.argv[3])
elif sys.argv[1] == "list":
list_manual_videos()
elif sys.argv[1] == "remove" and len(sys.argv) >= 3:
remove_manual_video(sys.argv[2])
else:
print("Usage:")
print(" python add_manual_video.py add 'Title' 'URL'")
print(" python add_manual_video.py list")
print(" python add_manual_video.py remove VIDEO_ID")
else:
# Interactive mode
main()

View File

@ -1,127 +0,0 @@
#!/usr/bin/env python3
"""
Script to build channel cache from raw yt-dlp output file.
This uses the fixed parsing logic to handle titles with | characters.
"""
import json
import re
from datetime import datetime
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
def parse_raw_output_file(raw_file_path):
"""Parse the raw output file and extract valid videos."""
videos = []
invalid_count = 0
print(f"🔍 Parsing raw output file: {raw_file_path}")
with open(raw_file_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
# Skip header lines (lines starting with #)
data_lines = [line for line in lines if not line.strip().startswith('#') and line.strip()]
print(f"📄 Found {len(data_lines)} data lines to process")
for i, line in enumerate(data_lines):
if i % 1000 == 0 and i > 0: # Progress indicator every 1000 lines
print(f"📊 Processing line {i}/{len(data_lines)}... ({i/len(data_lines)*100:.1f}%)")
# Remove line number prefix (e.g., " 1234: ")
line = re.sub(r'^\s*\d+:\s*', '', line.strip())
# More robust parsing that handles titles with | characters
# Extract video ID directly from the URL that yt-dlp provides
# Find the URL and extract video ID from it
url_match = re.search(r'https://www\.youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})', line)
if not url_match:
invalid_count += 1
if invalid_count <= 5:
print(f"⚠️ Skipping line with no URL: '{line[:100]}...'")
elif invalid_count == 6:
print(f"⚠️ ... and {len(data_lines) - i - 1} more invalid lines")
continue
# Extract video ID directly from the URL
video_id = url_match.group(1)
# Extract title (everything before the video ID in the line)
title = line[:line.find(video_id)].rstrip('|').strip()
# Validate video ID
if video_id and (
len(video_id) == 11 and
video_id.replace('-', '').replace('_', '').isalnum() and
" " not in video_id and
"Lyrics" not in video_id and
"KARAOKE" not in video_id.upper() and
"Vocal" not in video_id and
"Guide" not in video_id
):
videos.append({"title": title, "id": video_id})
else:
invalid_count += 1
if invalid_count <= 5: # Only show first 5 invalid IDs
print(f"⚠️ Skipping invalid video ID: '{video_id}' for title: '{title[:50]}...'")
elif invalid_count == 6:
print(f"⚠️ ... and {len(data_lines) - i - 1} more invalid IDs")
print(f"✅ Parsed {len(videos)} valid videos from raw output")
print(f"⚠️ Skipped {invalid_count} invalid video IDs")
return videos
def save_cache_file(channel_id, videos, cache_dir=None):
if cache_dir is None:
cache_dir = str(get_data_path_manager().get_channel_cache_dir())
"""Save the parsed videos to a cache file."""
cache_dir = Path(cache_dir)
cache_dir.mkdir(parents=True, exist_ok=True)
# Sanitize channel ID for filename
safe_channel_id = re.sub(r'[<>:"/\\|?*]', '_', channel_id)
cache_file = cache_dir / f"{safe_channel_id}.json"
data = {
'channel_id': channel_id,
'videos': videos,
'last_updated': datetime.now().isoformat(),
'video_count': len(videos)
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"💾 Saved cache to: {cache_file.name}")
return cache_file
def main():
"""Main function to build cache from raw output."""
data_path_manager = get_data_path_manager()
raw_file_path = data_path_manager.get_channel_cache_dir() / "@VocalStarKaraoke_raw_output.txt"
if not raw_file_path.exists():
print(f"❌ Raw output file not found: {raw_file_path}")
return
# Parse the raw output file
videos = parse_raw_output_file(raw_file_path)
if not videos:
print("❌ No valid videos found")
return
# Save to cache file
channel_id = "@VocalStarKaraoke"
cache_file = save_cache_file(channel_id, videos)
print(f"🎉 Cache build complete!")
print(f"📊 Total videos in cache: {len(videos)}")
print(f"📁 Cache file: {cache_file}")
if __name__ == "__main__":
main()

View File

@ -1,164 +0,0 @@
#!/usr/bin/env python3
"""
Utility script to identify and clean up duplicate files with (2), (3) suffixes.
This helps clean up files that were created before the duplicate prevention was implemented.
"""
import json
import re
from pathlib import Path
from typing import Dict, List, Tuple
def find_duplicate_files(downloads_dir: str = "downloads") -> Dict[str, List[Path]]:
"""
Find duplicate files with (2), (3), etc. suffixes in the downloads directory.
Args:
downloads_dir: Path to downloads directory
Returns:
Dictionary mapping base filenames to lists of duplicate files
"""
downloads_path = Path(downloads_dir)
if not downloads_path.exists():
print(f"❌ Downloads directory not found: {downloads_dir}")
return {}
duplicates = {}
# Scan all MP4 files in the downloads directory
for mp4_file in downloads_path.rglob("*.mp4"):
filename = mp4_file.name
# Check if this is a duplicate file with (2), (3), etc.
match = re.match(r'^(.+?)\s*\((\d+)\)\.mp4$', filename)
if match:
base_name = match.group(1)
suffix_num = int(match.group(2))
if base_name not in duplicates:
duplicates[base_name] = []
duplicates[base_name].append((mp4_file, suffix_num))
# Sort duplicates by suffix number
for base_name in duplicates:
duplicates[base_name].sort(key=lambda x: x[1])
return duplicates
def analyze_duplicates(duplicates: Dict[str, List[Tuple[Path, int]]]) -> None:
"""
Analyze and display information about found duplicates.
Args:
duplicates: Dictionary of duplicate files
"""
if not duplicates:
print("✅ No duplicate files found!")
return
print(f"🔍 Found {len(duplicates)} sets of duplicate files:")
print()
total_duplicates = 0
for base_name, files in duplicates.items():
print(f"📁 {base_name}")
for file_path, suffix in files:
file_size = file_path.stat().st_size / (1024 * 1024) # MB
print(f" ({suffix}) {file_path.name} - {file_size:.1f} MB")
print()
total_duplicates += len(files) - 1 # -1 because we keep the original
print(f"📊 Summary: {len(duplicates)} base files with {total_duplicates} duplicate files")
def cleanup_duplicates(duplicates: Dict[str, List[Tuple[Path, int]]], dry_run: bool = True) -> None:
"""
Clean up duplicate files, keeping only the first occurrence.
Args:
duplicates: Dictionary of duplicate files
dry_run: If True, only show what would be deleted without actually deleting
"""
if not duplicates:
print("✅ No duplicates to clean up!")
return
mode = "DRY RUN" if dry_run else "ACTUAL CLEANUP"
print(f"🧹 Starting {mode}...")
print()
total_deleted = 0
total_size_freed = 0
for base_name, files in duplicates.items():
print(f"📁 Processing: {base_name}")
# Keep the first file (lowest suffix number), delete the rest
files_to_delete = files[1:] # Skip the first file
for file_path, suffix in files_to_delete:
file_size = file_path.stat().st_size / (1024 * 1024) # MB
if dry_run:
print(f" 🗑️ Would delete: {file_path.name} ({file_size:.1f} MB)")
else:
try:
file_path.unlink()
print(f" ✅ Deleted: {file_path.name} ({file_size:.1f} MB)")
total_deleted += 1
total_size_freed += file_size
except Exception as e:
print(f" ❌ Failed to delete {file_path.name}: {e}")
print()
if dry_run:
print(f"📊 DRY RUN SUMMARY: Would delete {len([f for files in duplicates.values() for f in files[1:]])} files")
else:
print(f"📊 CLEANUP SUMMARY: Deleted {total_deleted} files, freed {total_size_freed:.1f} MB")
def main():
"""Main function to run the duplicate file cleanup."""
print("🎵 Karaoke Video Downloader - Duplicate File Cleanup")
print("=" * 50)
print()
# Find duplicates
duplicates = find_duplicate_files()
if not duplicates:
print("✅ No duplicate files found!")
return
# Analyze duplicates
analyze_duplicates(duplicates)
print()
# Ask user what to do
while True:
print("Options:")
print("1. Dry run (show what would be deleted)")
print("2. Actually delete duplicate files")
print("3. Exit without doing anything")
choice = input("\nEnter your choice (1-3): ").strip()
if choice == "1":
cleanup_duplicates(duplicates, dry_run=True)
break
elif choice == "2":
confirm = input("⚠️ Are you sure you want to delete duplicate files? (yes/no): ").strip().lower()
if confirm in ["yes", "y"]:
cleanup_duplicates(duplicates, dry_run=False)
else:
print("❌ Cleanup cancelled.")
break
elif choice == "3":
print("❌ Exiting without cleanup.")
break
else:
print("❌ Invalid choice. Please enter 1, 2, or 3.")
if __name__ == "__main__":
main()

View File

@ -1,465 +0,0 @@
#!/usr/bin/env python3
"""
Fix artist name formatting for Let's Sing Karaoke channel.
This script specifically targets the "Last Name, First Name" format and converts it to
"First Name Last Name" format in ID3 tags. It only processes entries where there is exactly one comma
followed by exactly 2 words, to avoid affecting multi-artist entries.
Usage:
python fix_artist_name_format.py --preview # Show what would be changed
python fix_artist_name_format.py --apply # Actually make the changes
python fix_artist_name_format.py --external "D:\Karaoke\Karaoke\MP4\Let's Sing Karaoke" # Use external directory
"""
import json
import os
import re
import shutil
import argparse
from pathlib import Path
from typing import Dict, List, Tuple, Optional
# Try to import mutagen for ID3 tag manipulation
try:
from mutagen.mp4 import MP4
MUTAGEN_AVAILABLE = True
except ImportError:
MUTAGEN_AVAILABLE = False
print("⚠️ mutagen not available - install with: pip install mutagen")
def is_lastname_firstname_format(artist_name: str) -> bool:
"""
Check if artist name is in "Last Name, First Name" format.
Args:
artist_name: The artist name to check
Returns:
True if the name matches "Last Name, First Name" format with exactly 2 words after comma
"""
if ',' not in artist_name:
return False
# Split by comma
parts = artist_name.split(',', 1)
if len(parts) != 2:
return False
last_name = parts[0].strip()
first_name_part = parts[1].strip()
# Check if there are exactly 2 words after the comma
words_after_comma = first_name_part.split()
if len(words_after_comma) != 2:
return False
# Additional check: make sure it's not a multi-artist entry
# If there are more than 2 words total in the artist name, it might be multi-artist
total_words = len(artist_name.split())
if total_words > 4: # Last, First Name (4 words max for single artist)
return False
return True
def convert_to_firstname_lastname(artist_name: str) -> str:
"""
Convert "Last Name, First Name" to "First Name Last Name".
Args:
artist_name: Artist name in "Last Name, First Name" format
Returns:
Artist name in "First Name Last Name" format
"""
parts = artist_name.split(',', 1)
last_name = parts[0].strip()
first_name_part = parts[1].strip()
# Split the first name part into words
words = first_name_part.split()
if len(words) == 2:
first_name = words[0]
middle_name = words[1]
return f"{first_name} {middle_name} {last_name}"
else:
# Fallback - just reverse the parts
return f"{first_name_part} {last_name}"
def extract_artist_title_from_filename(filename: str) -> Tuple[str, str]:
"""
Extract artist and title from a filename.
Args:
filename: MP4 filename (without extension)
Returns:
Tuple of (artist, title)
"""
# Remove .mp4 extension
if filename.endswith('.mp4'):
filename = filename[:-4]
# Look for " - " separator
if " - " in filename:
parts = filename.split(" - ", 1)
return parts[0].strip(), parts[1].strip()
return "", filename
def update_id3_tags(file_path: str, new_artist: str, apply_changes: bool = False) -> bool:
"""
Update the ID3 tags in an MP4 file.
Args:
file_path: Path to the MP4 file
new_artist: New artist name to set
apply_changes: Whether to actually apply changes or just preview
Returns:
True if successful, False otherwise
"""
if not MUTAGEN_AVAILABLE:
print(f"⚠️ mutagen not available - cannot update ID3 tags for {file_path}")
return False
try:
mp4 = MP4(file_path)
if apply_changes:
# Update the artist tag
mp4["\xa9ART"] = new_artist
mp4.save()
print(f"📝 Updated ID3 tag: {os.path.basename(file_path)} → Artist: '{new_artist}'")
else:
# Just preview what would be changed
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
print(f"📝 Would update ID3 tag: {os.path.basename(file_path)} → Artist: '{current_artist}''{new_artist}'")
return True
except Exception as e:
print(f"❌ Failed to update ID3 tags for {file_path}: {e}")
return False
def scan_external_directory(directory_path: str) -> List[Dict]:
"""
Scan external directory for MP4 files with "Last Name, First Name" format in ID3 tags.
Args:
directory_path: Path to the external directory
Returns:
List of files that need ID3 tag updates
"""
if not os.path.exists(directory_path):
print(f"❌ Directory not found: {directory_path}")
return []
if not MUTAGEN_AVAILABLE:
print("❌ mutagen not available - cannot scan ID3 tags")
return []
files_to_update = []
# Scan for MP4 files
for file_path in Path(directory_path).glob("*.mp4"):
try:
mp4 = MP4(str(file_path))
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
if current_artist and is_lastname_firstname_format(current_artist):
new_artist = convert_to_firstname_lastname(current_artist)
files_to_update.append({
'file_path': str(file_path),
'filename': file_path.name,
'old_artist': current_artist,
'new_artist': new_artist
})
except Exception as e:
print(f"⚠️ Could not read ID3 tags from {file_path.name}: {e}")
return files_to_update
def update_tracking_file(tracking_file: str, channel_name: str = "Let's Sing Karaoke", apply_changes: bool = False) -> Tuple[int, List[Dict]]:
"""
Update the karaoke tracking file to fix artist name formatting.
Args:
tracking_file: Path to the tracking JSON file
channel_name: Channel name to target (default: Let's Sing Karaoke)
apply_changes: Whether to actually apply changes or just preview
Returns:
Tuple of (number of changes made, list of changed entries)
"""
if not os.path.exists(tracking_file):
print(f"❌ Tracking file not found: {tracking_file}")
return 0, []
# Load the tracking data
with open(tracking_file, 'r', encoding='utf-8') as f:
data = json.load(f)
changes_made = 0
changed_entries = []
# Process songs
for song_key, song_data in data.get('songs', {}).items():
if song_data.get('channel_name') != channel_name:
continue
artist = song_data.get('artist', '')
if not artist or not is_lastname_firstname_format(artist):
continue
# Convert the artist name
new_artist = convert_to_firstname_lastname(artist)
if apply_changes:
# Update the tracking data
song_data['artist'] = new_artist
# Update the video title if it exists and contains the old artist name
video_title = song_data.get('video_title', '')
if video_title and artist in video_title:
song_data['video_title'] = video_title.replace(artist, new_artist)
# Update the file path if it exists
file_path = song_data.get('file_path', '')
if file_path and artist in file_path:
song_data['file_path'] = file_path.replace(artist, new_artist)
changes_made += 1
changed_entries.append({
'song_key': song_key,
'old_artist': artist,
'new_artist': new_artist,
'title': song_data.get('title', ''),
'file_path': song_data.get('file_path', '')
})
print(f"🔄 {'Updated' if apply_changes else 'Would update'}: '{artist}''{new_artist}' ({song_data.get('title', '')})")
# Save the updated data
if apply_changes and changes_made > 0:
# Create backup
backup_file = f"{tracking_file}.backup"
shutil.copy2(tracking_file, backup_file)
print(f"💾 Created backup: {backup_file}")
# Save updated file
with open(tracking_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"💾 Updated tracking file: {tracking_file}")
return changes_made, changed_entries
def update_songlist_tracking(songlist_file: str, channel_name: str = "Let's Sing Karaoke", apply_changes: bool = False) -> Tuple[int, List[Dict]]:
"""
Update the songlist tracking file to fix artist name formatting.
Args:
songlist_file: Path to the songlist tracking JSON file
channel_name: Channel name to target (default: Let's Sing Karaoke)
apply_changes: Whether to actually apply changes or just preview
Returns:
Tuple of (number of changes made, list of changed entries)
"""
if not os.path.exists(songlist_file):
print(f"❌ Songlist tracking file not found: {songlist_file}")
return 0, []
# Load the songlist data
with open(songlist_file, 'r', encoding='utf-8') as f:
data = json.load(f)
changes_made = 0
changed_entries = []
# Process songlist entries
for song_key, song_data in data.items():
artist = song_data.get('artist', '')
if not artist or not is_lastname_firstname_format(artist):
continue
# Convert the artist name
new_artist = convert_to_firstname_lastname(artist)
if apply_changes:
# Update the songlist data
song_data['artist'] = new_artist
changes_made += 1
changed_entries.append({
'song_key': song_key,
'old_artist': artist,
'new_artist': new_artist,
'title': song_data.get('title', '')
})
print(f"🔄 {'Updated' if apply_changes else 'Would update'} songlist: '{artist}''{new_artist}' ({song_data.get('title', '')})")
# Save the updated data
if apply_changes and changes_made > 0:
# Create backup
backup_file = f"{songlist_file}.backup"
shutil.copy2(songlist_file, backup_file)
print(f"💾 Created backup: {backup_file}")
# Save updated file
with open(songlist_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"💾 Updated songlist file: {songlist_file}")
return changes_made, changed_entries
def update_id3_tags_for_files(files_to_update: List[Dict], apply_changes: bool = False) -> int:
"""
Update ID3 tags for a list of files.
Args:
files_to_update: List of files to update
apply_changes: Whether to actually apply changes or just preview
Returns:
Number of files successfully updated
"""
updated_count = 0
for file_info in files_to_update:
file_path = file_info['file_path']
new_artist = file_info['new_artist']
if update_id3_tags(file_path, new_artist, apply_changes):
updated_count += 1
return updated_count
def main():
"""Main function to run the artist name fix script."""
parser = argparse.ArgumentParser(description="Fix artist name formatting in ID3 tags for Let's Sing Karaoke")
parser.add_argument('--preview', action='store_true', help='Show what would be changed without making changes')
parser.add_argument('--apply', action='store_true', help='Actually apply the changes')
parser.add_argument('--external', type=str, help='Path to external karaoke directory')
args = parser.parse_args()
# Default to preview mode if no action specified
if not args.preview and not args.apply:
args.preview = True
print("🎤 Artist Name Format Fix Script (ID3 Tags Only)")
print("=" * 60)
print("This script will fix 'Last Name, First Name' format to 'First Name Last Name'")
print("Only targeting Let's Sing Karaoke channel to avoid affecting other channels.")
print("Focusing on ID3 tags only - filenames will not be changed.")
print()
if not MUTAGEN_AVAILABLE:
print("❌ mutagen library not available!")
print("Please install it with: pip install mutagen")
return
if args.preview:
print("🔍 PREVIEW MODE - No changes will be made")
else:
print("⚡ APPLY MODE - Changes will be made")
print()
# File paths
tracking_file = "data/karaoke_tracking.json"
songlist_file = "data/songlist_tracking.json"
# Process external directory if specified
if args.external:
print(f"📁 Scanning external directory: {args.external}")
external_files = scan_external_directory(args.external)
if external_files:
print(f"\n📋 Found {len(external_files)} files with 'Last Name, First Name' format in ID3 tags:")
for file_info in external_files:
print(f"{file_info['filename']}: '{file_info['old_artist']}''{file_info['new_artist']}'")
if args.apply:
print(f"\n📝 Updating ID3 tags in external files...")
updated_count = update_id3_tags_for_files(external_files, apply_changes=True)
print(f"✅ Updated ID3 tags in {updated_count} external files")
else:
print(f"\n📝 Would update ID3 tags in {len(external_files)} external files")
else:
print("✅ No files with 'Last Name, First Name' format found in ID3 tags")
# Process tracking files (only if they exist in current project)
if os.path.exists(tracking_file):
print(f"\n📊 Processing karaoke tracking file...")
tracking_changes, tracking_entries = update_tracking_file(tracking_file, apply_changes=args.apply)
else:
print(f"\n⚠️ Tracking file not found: {tracking_file}")
tracking_changes = 0
if os.path.exists(songlist_file):
print(f"\n📊 Processing songlist tracking file...")
songlist_changes, songlist_entries = update_songlist_tracking(songlist_file, apply_changes=args.apply)
else:
print(f"\n⚠️ Songlist tracking file not found: {songlist_file}")
songlist_changes = 0
# Process local downloads directory ID3 tags
downloads_dir = "downloads"
local_id3_updates = 0
if os.path.exists(downloads_dir) and tracking_changes > 0:
print(f"\n📝 Processing ID3 tags in local downloads directory...")
# Scan local downloads for files that need ID3 tag updates
local_files = []
for entry in tracking_entries:
file_path = entry.get('file_path', '')
if file_path and os.path.exists(file_path.replace('\\', '/')):
local_files.append({
'file_path': file_path.replace('\\', '/'),
'filename': os.path.basename(file_path),
'old_artist': entry['old_artist'],
'new_artist': entry['new_artist']
})
if local_files:
local_id3_updates = update_id3_tags_for_files(local_files, apply_changes=args.apply)
total_changes = tracking_changes + songlist_changes
print("\n" + "=" * 60)
print("📋 Summary:")
print(f" • Tracking file changes: {tracking_changes}")
print(f" • Songlist file changes: {songlist_changes}")
print(f" • Local ID3 tag updates: {local_id3_updates}")
print(f" • Total changes: {total_changes}")
if args.external:
external_count = len(scan_external_directory(args.external)) if args.preview else len(external_files)
print(f" • External ID3 tag updates: {external_count}")
if total_changes > 0 or (args.external and external_count > 0):
if args.apply:
print("\n✅ Artist name formatting in ID3 tags has been fixed!")
print("💾 Backups have been created for all modified files.")
print("🔄 You may need to re-run your karaoke downloader to update any cached data.")
else:
print("\n🔍 Preview complete. Use --apply to make these changes.")
else:
print("\n✅ No changes needed! All artist names are already in the correct format.")
if __name__ == "__main__":
main()

View File

@ -1,295 +0,0 @@
#!/usr/bin/env python3
"""
Fix artist name formatting for Let's Sing Karaoke channel.
This script specifically targets the "Last Name, First Name" format and converts it to
"First Name Last Name" format in ID3 tags. It only processes entries where there is exactly one comma
followed by exactly 2 words, to avoid affecting multi-artist entries.
Usage:
python fix_artist_name_format_simple.py --preview # Show what would be changed
python fix_artist_name_format_simple.py --apply # Actually make the changes
python fix_artist_name_format_simple.py --external "D:\Karaoke\Karaoke\MP4\Let's Sing Karaoke" # Use external directory
"""
import json
import os
import re
import shutil
import argparse
from pathlib import Path
from typing import Dict, List, Tuple, Optional
# Try to import mutagen for ID3 tag manipulation
try:
from mutagen.mp4 import MP4
MUTAGEN_AVAILABLE = True
except ImportError:
MUTAGEN_AVAILABLE = False
print("WARNING: mutagen not available - install with: pip install mutagen")
def is_lastname_firstname_format(artist_name: str) -> bool:
"""
Check if artist name is in "Last Name, First Name" format.
Args:
artist_name: The artist name to check
Returns:
True if the name matches "Last Name, First Name" format with exactly 1 or 2 words after comma
"""
if ',' not in artist_name:
return False
# Split by comma
parts = artist_name.split(',', 1)
if len(parts) != 2:
return False
last_name = parts[0].strip()
first_name_part = parts[1].strip()
# Check if there are exactly 1 or 2 words after the comma
words_after_comma = first_name_part.split()
if len(words_after_comma) not in [1, 2]:
return False
# Additional check: make sure it's not a multi-artist entry
# If there are more than 4 words total in the artist name, it might be multi-artist
total_words = len(artist_name.split())
if total_words > 4: # Last, First Name (4 words max for single artist)
return False
return True
def convert_lastname_firstname(artist_name: str) -> str:
"""
Convert "Last Name, First Name" to "First Name Last Name".
Args:
artist_name: The artist name to convert
Returns:
The converted artist name
"""
if ',' not in artist_name:
return artist_name
parts = artist_name.split(',', 1)
if len(parts) != 2:
return artist_name
last_name = parts[0].strip()
first_name = parts[1].strip()
return f"{first_name} {last_name}"
def process_artist_name(artist_name: str) -> str:
"""
Process an artist name, handling both single artists and multiple artists separated by "&".
Args:
artist_name: The artist name to process
Returns:
The processed artist name
"""
if '&' in artist_name:
# Split by "&" and process each artist individually
artists = [artist.strip() for artist in artist_name.split('&')]
processed_artists = []
for artist in artists:
if is_lastname_firstname_format(artist):
processed_artist = convert_lastname_firstname(artist)
processed_artists.append(processed_artist)
else:
processed_artists.append(artist)
# Rejoin with "&"
return ' & '.join(processed_artists)
else:
# Single artist
if is_lastname_firstname_format(artist_name):
return convert_lastname_firstname(artist_name)
else:
return artist_name
def update_id3_tags(file_path: str, new_artist: str, apply_changes: bool = False) -> bool:
"""
Update the ID3 tags in an MP4 file.
Args:
file_path: Path to the MP4 file
new_artist: New artist name to set
apply_changes: Whether to actually apply changes or just preview
Returns:
True if successful, False otherwise
"""
if not MUTAGEN_AVAILABLE:
print(f"WARNING: mutagen not available - cannot update ID3 tags for {file_path}")
return False
try:
mp4 = MP4(file_path)
if apply_changes:
# Update the artist tag
mp4["\xa9ART"] = new_artist
mp4.save()
print(f"UPDATED ID3 tag: {os.path.basename(file_path)} -> Artist: '{new_artist}'")
else:
# Just preview what would be changed
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
print(f"WOULD UPDATE ID3 tag: {os.path.basename(file_path)} -> Artist: '{current_artist}' -> '{new_artist}'")
return True
except Exception as e:
print(f"ERROR: Failed to update ID3 tags for {file_path}: {e}")
return False
def scan_external_directory(directory_path: str, debug: bool = False) -> List[Dict]:
"""
Scan external directory for MP4 files with "Last Name, First Name" format in ID3 tags.
Args:
directory_path: Path to the external directory
debug: Whether to show debug information
Returns:
List of files that need ID3 tag updates
"""
if not os.path.exists(directory_path):
print(f"ERROR: Directory not found: {directory_path}")
return []
if not MUTAGEN_AVAILABLE:
print("ERROR: mutagen not available - cannot scan ID3 tags")
return []
files_to_update = []
total_files = 0
files_with_artist_tags = 0
# Scan for MP4 files
for file_path in Path(directory_path).glob("*.mp4"):
total_files += 1
try:
mp4 = MP4(str(file_path))
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
if current_artist != "Unknown":
files_with_artist_tags += 1
if debug:
print(f"DEBUG: {file_path.name} -> Artist: '{current_artist}'")
# Process the artist name to handle multiple artists
processed_artist = process_artist_name(current_artist)
if processed_artist != current_artist:
files_to_update.append({
'file_path': str(file_path),
'filename': file_path.name,
'old_artist': current_artist,
'new_artist': processed_artist
})
if debug:
print(f"DEBUG: MATCH FOUND - {file_path.name}: '{current_artist}' -> '{processed_artist}'")
except Exception as e:
if debug:
print(f"WARNING: Could not read ID3 tags from {file_path.name}: {e}")
print(f"INFO: Scanned {total_files} MP4 files, {files_with_artist_tags} had artist tags, {len(files_to_update)} need updates")
return files_to_update
def update_id3_tags_for_files(files_to_update: List[Dict], apply_changes: bool = False) -> int:
"""
Update ID3 tags for a list of files.
Args:
files_to_update: List of files to update
apply_changes: Whether to actually apply changes or just preview
Returns:
Number of files successfully updated
"""
updated_count = 0
for file_info in files_to_update:
file_path = file_info['file_path']
new_artist = file_info['new_artist']
if update_id3_tags(file_path, new_artist, apply_changes):
updated_count += 1
return updated_count
def main():
"""Main function to run the artist name fix script."""
parser = argparse.ArgumentParser(description="Fix artist name formatting in ID3 tags for Let's Sing Karaoke")
parser.add_argument('--preview', action='store_true', help='Show what would be changed without making changes')
parser.add_argument('--apply', action='store_true', help='Actually apply the changes')
parser.add_argument('--external', type=str, help='Path to external karaoke directory')
parser.add_argument('--debug', action='store_true', help='Show debug information')
args = parser.parse_args()
# Default to preview mode if no action specified
if not args.preview and not args.apply:
args.preview = True
print("Artist Name Format Fix Script (ID3 Tags Only)")
print("=" * 60)
print("This script will fix 'Last Name, First Name' format to 'First Name Last Name'")
print("Only targeting Let's Sing Karaoke channel to avoid affecting other channels.")
print("Focusing on ID3 tags only - filenames will not be changed.")
print()
if not MUTAGEN_AVAILABLE:
print("ERROR: mutagen library not available!")
print("Please install it with: pip install mutagen")
return
if args.preview:
print("PREVIEW MODE - No changes will be made")
else:
print("APPLY MODE - Changes will be made")
print()
# Process external directory if specified
if args.external:
print(f"Scanning external directory: {args.external}")
external_files = scan_external_directory(args.external, debug=args.debug)
if external_files:
print(f"\nFound {len(external_files)} files with 'Last Name, First Name' format in ID3 tags:")
for file_info in external_files:
print(f" * {file_info['filename']}: '{file_info['old_artist']}' -> '{file_info['new_artist']}'")
if args.apply:
print(f"\nUpdating ID3 tags in external files...")
updated_count = update_id3_tags_for_files(external_files, apply_changes=True)
print(f"SUCCESS: Updated ID3 tags in {updated_count} external files")
else:
print(f"\nWould update ID3 tags in {len(external_files)} external files")
else:
print("SUCCESS: No files with 'Last Name, First Name' format found in ID3 tags")
print("\n" + "=" * 60)
print("Summary complete.")
if __name__ == "__main__":
main()

View File

@ -1,151 +0,0 @@
#!/usr/bin/env python3
"""
Script to reset karaoke tracking and re-download files with the new channel parser.
This script will:
1. Reset the karaoke_tracking.json to remove all downloaded entries
2. Optionally delete the downloaded files
3. Allow you to re-download with the new channel parser system
"""
import json
import os
import shutil
from pathlib import Path
from typing import List, Dict, Any
from karaoke_downloader.data_path_manager import get_data_path_manager
def reset_karaoke_tracking(tracking_file: str = None) -> None:
if tracking_file is None:
tracking_file = str(get_data_path_manager().get_karaoke_tracking_path())
"""Reset the karaoke tracking file to empty state."""
print(f"Resetting {tracking_file}...")
# Create backup of current tracking
backup_file = f"{tracking_file}.backup"
if os.path.exists(tracking_file):
shutil.copy2(tracking_file, backup_file)
print(f"Created backup: {backup_file}")
# Reset to empty state
empty_tracking = {
"playlists": {},
"songs": {}
}
with open(tracking_file, 'w', encoding='utf-8') as f:
json.dump(empty_tracking, f, indent=2, ensure_ascii=False)
print(f"✅ Reset {tracking_file} to empty state")
def delete_downloaded_files(downloads_dir: str = "downloads") -> None:
"""Delete all downloaded files and folders."""
if not os.path.exists(downloads_dir):
print(f"Downloads directory {downloads_dir} does not exist.")
return
print(f"Deleting all files in {downloads_dir}...")
try:
shutil.rmtree(downloads_dir)
print(f"✅ Deleted {downloads_dir} directory")
except Exception as e:
print(f"❌ Error deleting {downloads_dir}: {e}")
def show_download_stats(tracking_file: str = None) -> None:
if tracking_file is None:
tracking_file = str(get_data_path_manager().get_karaoke_tracking_path())
"""Show statistics about current downloads."""
if not os.path.exists(tracking_file):
print("No tracking file found.")
return
with open(tracking_file, 'r', encoding='utf-8') as f:
tracking = json.load(f)
songs = tracking.get("songs", {})
total_songs = len(songs)
if total_songs == 0:
print("No songs in tracking file.")
return
# Count by status
status_counts = {}
channel_counts = {}
for song_id, song_data in songs.items():
status = song_data.get("status", "UNKNOWN")
channel = song_data.get("channel_name", "UNKNOWN")
status_counts[status] = status_counts.get(status, 0) + 1
channel_counts[channel] = channel_counts.get(channel, 0) + 1
print(f"\n📊 Current Download Statistics:")
print(f"Total songs: {total_songs}")
print(f"\nBy Status:")
for status, count in status_counts.items():
print(f" {status}: {count}")
print(f"\nBy Channel:")
for channel, count in channel_counts.items():
print(f" {channel}: {count}")
def main():
"""Main function to handle reset and re-download process."""
print("🔄 Karaoke Download Reset and Re-download Tool")
print("=" * 50)
# Show current stats
print("\nCurrent download statistics:")
show_download_stats()
# Ask user what they want to do
print("\nOptions:")
print("1. Reset tracking only (keep files)")
print("2. Reset tracking and delete all downloaded files")
print("3. Show current stats only")
print("4. Exit")
choice = input("\nEnter your choice (1-4): ").strip()
if choice == "1":
print("\n🔄 Resetting tracking only...")
reset_karaoke_tracking()
print("\n✅ Tracking reset complete!")
print("You can now re-download files with the new channel parser system.")
print("\nTo re-download, run:")
print("python download_karaoke.py --file data/channels.txt --limit 50")
elif choice == "2":
print("\n🔄 Resetting tracking and deleting files...")
confirm = input("Are you sure you want to delete ALL downloaded files? (yes/no): ").strip().lower()
if confirm == "yes":
reset_karaoke_tracking()
delete_downloaded_files()
print("\n✅ Reset complete! All tracking and files have been removed.")
print("You can now re-download files with the new channel parser system.")
print("\nTo re-download, run:")
print("python download_karaoke.py --file data/channels.txt --limit 50")
else:
print("Operation cancelled.")
elif choice == "3":
print("\n📊 Current statistics:")
show_download_stats()
elif choice == "4":
print("Exiting...")
else:
print("Invalid choice. Please enter 1, 2, 3, or 4.")
if __name__ == "__main__":
main()