Compare commits

...

31 Commits

Author SHA1 Message Date
1b6ac6454b Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-11 09:01:31 -05:00
e34c43a8f4 Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-11 09:00:46 -05:00
6a796d8571 Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-08-10 10:28:29 -05:00
b0eb76930a Merge branch 'develop' of ssh://git@192.168.1.128:220/mbrucedogs/KaraokeVideoDownloader.git into develop 2025-08-05 16:31:03 -05:00
157f3a171b Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-08-05 16:30:20 -05:00
eb3642d652 mac support
Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>
2025-08-05 16:11:29 -05:00
a82c9741a5 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-08-05 15:38:39 -05:00
50b402ddec Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-29 09:07:31 -05:00
9f0787d00a Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-29 08:56:25 -05:00
409e66780c Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-29 08:49:43 -05:00
42e7a6a09c Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-29 08:45:12 -05:00
ec95b24a69 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 15:44:46 -05:00
21f8348419 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 14:23:19 -05:00
d18ac54476 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 14:09:07 -05:00
c48c1d3696 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 13:47:36 -05:00
273a748a1a Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 12:02:50 -05:00
5f3b00a39a Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 12:02:42 -05:00
24a6a37efd Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 09:30:09 -05:00
c864af7794 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 08:09:47 -05:00
613b64601a Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 07:51:40 -05:00
981f92ce95 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-28 05:42:07 -05:00
8dbc2fb8fd Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 22:52:17 -05:00
81b3d2d88c Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 22:49:35 -05:00
95a49bf39e Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 22:02:18 -05:00
c8f02ac3b4 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 21:58:52 -05:00
f914d54067 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 20:33:26 -05:00
ea07188739 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 19:47:05 -05:00
2c63bf809b Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 12:38:35 -05:00
7090fad1fd Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 12:01:39 -05:00
c78be7a7ad Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 11:40:57 -05:00
e6b2c9443c Signed-off-by: mbrucedogs <mbrucedogs@gmail.com> 2025-07-27 10:56:19 -05:00
59 changed files with 444314 additions and 179720 deletions

3
.gitignore vendored
View File

@ -14,9 +14,6 @@ logs/
*.log
# Tracking and cache files
karaoke_tracking.json
karaoke_tracking.json.backup
songlist_tracking.json
*.cache
# yt-dlp temporary files

379
PRD.md
View File

@ -1,8 +1,8 @@
# 🎤 Karaoke Video Downloader PRD (v3.3)
# 🎤 Karaoke Video Downloader PRD (v3.4.4)
## ✅ Overview
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection. The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse.
---
@ -63,9 +63,9 @@ The codebase has been refactored into focused modules with centralized utilities
---
## ⚙️ Platform & Stack
- **Platform:** Windows
- **Platform:** Windows, macOS
- **Interface:** Command-line (CLI)
- **Tech Stack:** Python 3.7+, yt-dlp.exe, mutagen (for ID3 tagging)
- **Tech Stack:** Python 3.7+, yt-dlp (platform-specific binary), mutagen (for ID3 tagging)
---
@ -101,6 +101,7 @@ python download_karaoke.py --clear-cache SingKingKaraoke
- ✅ Songlist integration: prioritize and track custom songlists
- ✅ Songlist-only mode: download only songs from the songlist
- ✅ Songlist focus mode: download only songs from specific playlists by title
- ✅ Force download mode: bypass all existing file checks and re-download songs regardless of server duplicates or existing files
- ✅ Global songlist tracking to avoid duplicates across channels
- ✅ ID3 tagging for artist/title in MP4 files (mutagen)
- ✅ Real-time progress and detailed logging
@ -122,6 +123,8 @@ python download_karaoke.py --clear-cache SingKingKaraoke
- ✅ **Centralized file operations**: Single source of truth for filename sanitization, file validation, and path operations
- ✅ **Centralized song validation**: Unified logic for checking if songs should be downloaded across all modules
- ✅ **Enhanced configuration management**: Structured configuration with dataclasses, type safety, and validation
- ✅ **Manual video collection**: Static video collection system for managing individual karaoke videos that don't belong to regular channels. Use `--manual` to download from `data/manual_videos.json`.
- ✅ **Channel-specific parsing rules**: JSON-based configuration for parsing video titles from different YouTube channels, with support for various title formats and cleanup rules.
---
@ -149,19 +152,34 @@ KaroakeVideoDownloader/
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
├── config/ # Configuration files
│ └── config.json # Main configuration file
├── data/ # All tracking, cache, and songlist files
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.txt
│ ├── channels.json # Channel configuration with parsing rules
│ ├── manual_videos.json # Manual video collection
│ └── songList.json
├── utilities/ # Utility scripts and tools
│ ├── add_manual_video.py # Manual video management
│ ├── build_cache_from_raw.py # Cache building utility
│ ├── cleanup_duplicate_files.py # File cleanup utilities
│ ├── cleanup_recent_tracking.py # Tracking cleanup utilities
│ ├── deduplicate_songlist_tracking.py # Data deduplication
│ ├── fix_artist_name_format.py # Data cleanup utilities
│ ├── fix_artist_name_format_simple.py
│ ├── fix_code_quality.py # Development tools
│ ├── reset_and_redownload.py # Maintenance utilities
│ └── songlist_report.py # Reporting utilities
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
├── src/tests/ # Test scripts
│ ├── test_macos.py # macOS setup and functionality tests
│ └── test_platform.py # Platform detection tests
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
@ -176,6 +194,8 @@ KaroakeVideoDownloader/
- `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
- `--songlist-file <FILE_PATH>`: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
- `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution
@ -188,7 +208,11 @@ KaroakeVideoDownloader/
- `--fuzzy-match`: **Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)**
- `--fuzzy-threshold <N>`: **Fuzzy match threshold (0-100, default 85)**
- `--parallel`: **Enable parallel downloads for improved speed**
- `--workers <N>`: **Number of parallel download workers (1-10, default: 3)**
- `--workers <N>`: **Number of parallel download workers (1-10, default: 3, only used with --parallel)**
- `--manual`: **Download from manual videos collection (data/manual_videos.json)**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files and songs in songs.json**
- `--dry-run`: **Build download plan and show what would be downloaded without actually downloading anything**
---
@ -199,6 +223,8 @@ KaroakeVideoDownloader/
- **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files.
- **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download.
- **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels.
- **Channel-Specific Parsing:** Uses `data/channels.json` to define parsing rules for each YouTube channel, handling different video title formats (e.g., "Artist - Title", "Artist Title", "Title | Artist", etc.).
- **Manual Video Collection:** Static video management system using `data/manual_videos.json` for individual karaoke videos that don't belong to regular channels. Accessible via `--manual` parameter.
## 🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
@ -252,7 +278,7 @@ The codebase has been comprehensively refactored to improve maintainability and
### **New Parallel Download System (v3.4)**
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
- **Configurable concurrency:** Use `--parallel --workers N` to enable parallel downloads with N workers (1-10)
- **Configurable concurrency:** Use `--parallel` to enable parallel downloads with 3 workers by default, or `--parallel --workers N` for custom worker count (1-10)
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
@ -268,8 +294,337 @@ The codebase has been comprehensively refactored to improve maintainability and
- [ ] Download scheduling and retry logic
- [ ] More granular status reporting
- [x] **Parallel downloads for improved speed** ✅ **COMPLETED**
- [x] **Enhanced fuzzy matching with improved video title parsing** ✅ **COMPLETED**
- [x] **Consolidated extract_artist_title function** ✅ **COMPLETED**
- [x] **Duplicate file prevention and filename consistency** ✅ **COMPLETED**
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows
- [ ] Plugin system for custom file operations
- [ ] Advanced configuration UI
- [ ] Real-time download progress visualization
## 🔧 Recent Bug Fixes & Improvements (v3.4.1)
### **Enhanced Fuzzy Matching (v3.4.1)**
- **Improved `extract_artist_title` function**: Enhanced to handle multiple video title formats beyond simple "Artist - Title" patterns
- **"Title Karaoke | Artist Karaoke Version" format**: Correctly parses titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
- **"Title Artist KARAOKE" format**: Handles titles ending with "KARAOKE" and attempts to extract artist information
- **Fallback handling**: Returns empty artist and full title for unparseable formats
- **Consolidated function usage**: Removed duplicate `extract_artist_title` implementations across modules
- **Single source of truth**: All modules now import from `fuzzy_matcher.py`
- **Consistent parsing**: Eliminated inconsistencies between different parsing implementations
- **Better maintainability**: Changes to parsing logic only need to be made in one place
### **Fixed Import Conflicts**
- **Resolved import conflict in `download_planner.py`**: Updated to use the enhanced `extract_artist_title` from `fuzzy_matcher.py` instead of the simpler version from `id3_utils.py`
- **Updated `id3_utils.py`**: Now imports `extract_artist_title` from `fuzzy_matcher.py` for consistency
### **Enhanced --limit Parameter**
- **Fixed limit application**: The `--limit` parameter now correctly applies to the scanning phase, not just the download execution
- **Improved performance**: When using `--limit N`, only the first N songs are scanned against channels, significantly reducing processing time for large songlists
### **Benefits of Recent Improvements**
- **Better matching accuracy**: Enhanced fuzzy matching can now handle a wider variety of video title formats commonly found on YouTube karaoke channels
- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
- **Consistent behavior**: All parts of the system use the same parsing logic, eliminating edge cases where different modules would parse the same title differently
- **Improved performance**: The `--limit` parameter now works as expected, providing faster processing for targeted downloads
- **Cleaner codebase**: Eliminated duplicate code and import conflicts, making the system more maintainable
## 🔧 Recent Bug Fixes & Improvements (v3.4.2)
### **Duplicate File Prevention & Filename Consistency**
- **Enhanced file existence checking**: `check_file_exists_with_patterns()` now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
- **Automatic duplicate prevention**: Download pipeline skips downloads when files already exist (including duplicates)
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files with suffixes
- **Cleanup utility**: `data/cleanup_duplicate_files.py` provides interactive cleanup of existing duplicate files
- **Filename vs ID3 tag consistency**: Removed "(Karaoke Version)" suffix from ID3 tags to match filenames exactly
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction logic
### **Benefits of Duplicate Prevention**
- **No more duplicate files**: Eliminates `(2)`, `(3)` suffix files that waste disk space
- **Consistent metadata**: Filename and ID3 tag use identical artist/title format
- **Efficient disk usage**: Prevents unnecessary downloads of existing files
- **Clear file identification**: Consistent naming across all file operations
## 🛠️ Maintenance
### **Regular Cleanup**
- Run the cleanup utility periodically to remove any duplicate files
- Monitor downloads for any new duplicate creation (should be rare with fixes)
### **Configuration**
- Keep `"nooverwrites": false` in `data/config.json`
- This prevents yt-dlp from creating duplicate files
### **Monitoring**
- Check logs for "⏭️ Skipping download - file already exists" messages
- These indicate the duplicate prevention is working correctly
## 🔧 Recent Bug Fixes & Improvements (v3.4.3)
### **Manual Video Collection System**
- **New `--manual` parameter**: Simple access to manual video collection via `python download_karaoke.py --manual --limit 5`
- **Static video management**: `data/manual_videos.json` stores individual karaoke videos that don't belong to regular channels
- **Helper script**: `add_manual_video.py` provides easy management of manual video entries
- **Full integration**: Manual videos work with all existing features (songlist matching, fuzzy matching, parallel downloads, etc.)
- **No yt-dlp dependency**: Manual videos bypass YouTube API calls for video listing, using static data instead
### **Channel-Specific Parsing Rules**
- **JSON-based configuration**: `data/channels.json` replaces `data/channels.txt` with structured channel configuration
- **Parsing rules per channel**: Each channel can define custom parsing rules for video titles
- **Multiple format support**: Handles various title formats like "Artist - Title", "Artist Title", "Title | Artist", etc.
- **Suffix cleanup**: Automatic removal of common karaoke-related suffixes
- **Multi-artist support**: Parsing for titles with multiple artists separated by specific delimiters
- **Backward compatibility**: Still supports legacy `data/channels.txt` format
### **Benefits of New Features**
- **Flexible video management**: Easy addition of individual karaoke videos without creating new channels
- **Accurate parsing**: Channel-specific rules ensure correct artist/title extraction for ID3 tags and filenames
- **Consistent metadata**: Proper parsing prevents filename and ID3 tag inconsistencies
- **Easy maintenance**: Simple JSON structure for managing both channels and manual videos
- **Full feature compatibility**: Manual videos work seamlessly with existing download modes and features
## 📚 Documentation Standards
### **Documentation Location**
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
- **Use the existing sections in PRD.md and README.md to track all project evolution**
### **Where to Document Changes**
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
- **README.md**: User-facing features, usage instructions, and high-level improvements
- **CHANGELOG.md**: Version-specific release notes and change summaries
### **Documentation Requirements**
- **All new features must be documented in both PRD.md and README.md**
- **All refactoring efforts must be documented in the appropriate sections**
- **All bug fixes must be documented with technical details**
- **Version numbers and dates should be clearly marked**
- **Benefits and improvements should be explicitly stated**
### **Maintenance Responsibility**
- **Keep PRD.md and README.md synchronized with code changes**
- **Update documentation immediately when implementing new features**
- **Remove outdated information and consolidate related changes**
- **Ensure all CLI options and features are documented in both files**
## 🔧 Recent Bug Fixes & Improvements (v3.4.4)
### **All Videos Download Mode**
- **New `--all-videos` parameter**: Download all videos from a channel, not just songlist matches
- **Smart MP3/MP4 detection**: Automatically detects if you have MP3 versions in songs.json and downloads MP4 video versions
- **Existing file skipping**: Skips videos that already exist on the filesystem
- **Progress tracking**: Shows clear progress with "Downloading X/Y videos" format
- **Parallel processing support**: Works with `--parallel --workers N` for faster downloads
- **Channel focus integration**: Works with `--channel-focus` to target specific channels
- **Limit support**: Works with `--limit N` to control download batch size
### **Smart Songlist Integration**
- **MP4 version detection**: Checks if MP4 version already exists in songs.json before downloading
- **MP3 upgrade path**: Downloads MP4 video versions when only MP3 versions exist in songlist
- **Duplicate prevention**: Skips downloads when MP4 versions already exist
- **Efficient filtering**: Only processes videos that need to be downloaded
### **Benefits of All Videos Mode**
- **Complete channel downloads**: Download entire channels without songlist restrictions
- **Automatic format upgrading**: Upgrade MP3 collections to MP4 video versions
- **Efficient processing**: Only downloads videos that don't already exist
- **Flexible control**: Use with limits, parallel processing, and channel targeting
- **Clear progress feedback**: Real-time progress tracking for large downloads
## 🔧 Recent Bug Fixes & Improvements (v3.4.5)
### **Unified Download Workflow Architecture**
- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution
- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation
- **Unified cache management**: Consistent progress tracking and resume functionality across all modes
### **Architecture Pattern for New Download Modes**
When adding new download modes in the future, follow this pattern to ensure consistency:
#### **1. Download Plan Building (Mode-Specific)**
Each download mode should build a download plan (list of videos to download) with this structure:
```python
download_plan = [
{
"video_id": "video_id",
"artist": "artist_name",
"title": "song_title",
"filename": "sanitized_filename.mp4",
"channel_name": "channel_name",
"video_title": "original_video_title",
"force_download": False
}
]
```
#### **2. Unified Execution (Shared)**
All modes should use the unified execution workflow:
```python
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file, # Optional, for progress tracking
limit=limit, # Optional, for limiting downloads
show_progress=True, # Optional, for progress display
)
```
#### **3. Execution Method Selection (Automatic)**
The unified workflow automatically chooses execution method based on settings:
- **Sequential**: Uses `DownloadPipeline` for single-threaded downloads
- **Parallel**: Uses `ParallelDownloader` when `--parallel` is enabled
#### **4. Required Implementation Pattern**
```python
def download_new_mode(self, ...):
"""New download mode implementation."""
# 1. Build download plan (mode-specific logic)
download_plan = []
for video in videos_to_download:
download_plan.append({
"video_id": video["id"],
"artist": artist,
"title": title,
"filename": filename,
"channel_name": channel_name,
"video_title": video["title"],
"force_download": force_download
})
# 2. Create cache file (optional, for progress tracking)
cache_file = get_download_plan_cache_file("new_mode", **plan_kwargs)
save_plan_cache(cache_file, download_plan, [])
# 3. Use unified execution workflow
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file,
limit=limit,
show_progress=True,
)
return success
```
### **Benefits of Unified Architecture**
- **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- **Maintainability**: Changes to download execution only need to be made in one place
- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
- **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
- **Testing**: Easier to test since all modes use the same execution logic
### **What Was Fixed**
- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies
- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented
- **Code Duplication**: Eliminated duplicate download execution logic across different modes
- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management
### **Future Development Guidelines**
1. **NEVER implement custom download execution logic** in new download modes
2. **ALWAYS use `execute_unified_download_workflow()`** for download execution
3. **Focus on download plan building** - that's where mode-specific logic belongs
4. **Use the standard download plan structure** for consistency
5. **Implement cache file handling** for progress tracking and resume functionality
6. **Test with both sequential and parallel modes** to ensure compatibility
---
## 🚀 Future Enhancements
- [ ] Web UI for easier management
- [ ] More advanced song matching (multi-language)
- [ ] Download scheduling and retry logic
- [ ] More granular status reporting
- [x] **Parallel downloads for improved speed** ✅ **COMPLETED**
- [x] **Enhanced fuzzy matching with improved video title parsing** ✅ **COMPLETED**
- [x] **Consolidated extract_artist_title function** ✅ **COMPLETED**
- [x] **Duplicate file prevention and filename consistency** ✅ **COMPLETED**
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows
- [ ] Plugin system for custom file operations
- [ ] Advanced configuration UI
- [ ] Real-time download progress visualization
## 🔧 Recent Bug Fixes & Improvements (v3.4.4)
### **macOS Support with Automatic Platform Detection**
- **Cross-platform compatibility**: Added support for macOS alongside Windows
- **Automatic platform detection**: Detects operating system and selects appropriate yt-dlp binary
- **Flexible yt-dlp integration**: Supports both binary files (`yt-dlp_macos`) and pip installation (`python3 -m yt_dlp`)
- **Setup automation**: `setup_macos.py` script for easy macOS setup with FFmpeg and yt-dlp installation
- **Command parsing**: Intelligent parsing of yt-dlp commands (file paths vs. module commands)
- **Enhanced validation**: Platform-specific error messages and validation in CLI
- **Backward compatibility**: Maintains full compatibility with existing Windows installations
### **Benefits of macOS Support**
- **Native macOS experience**: No need for Windows compatibility layers or virtualization
- **Automatic setup**: Simple setup script handles all dependencies
- **Flexible installation**: Choose between binary download or pip installation
- **Consistent functionality**: All features work identically on both platforms
- **Easy maintenance**: Platform detection handles configuration automatically
### **Setup Instructions**
```bash
# Automatic setup (recommended)
python3 setup_macos.py
# Test installation
python3 src/tests/test_macos.py
# Manual setup options
# 1. Install yt-dlp via pip: pip3 install yt-dlp
# 2. Download binary: curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
# 3. Install FFmpeg: brew install ffmpeg
```
## 🔧 Recent Bug Fixes & Improvements (v3.4.7)
### **Configurable Data Directory Path**
- **Centralized Data Path Management**: New `data_path_manager.py` module provides unified data directory path management
- **Configurable Location**: Data directory path can be set in `config/config.json` under `folder_structure.data_dir`
- **Backward Compatibility**: Defaults to "data" directory if not configured
- **Cross-Project Integration**: Enables the karaoke downloader to be used as a component in other projects with different data directory structures
- **Updated All Modules**: All modules now use the data path manager instead of hardcoded "data/" paths
- **Utility Functions**: Provides `get_data_path()`, `get_data_dir()`, and `get_data_path_manager()` functions for easy access
- **Fixed Circular Dependency**: Moved `config.json` from `data/` to root directory to resolve chicken-and-egg problem
### **Benefits of Configurable Data Directory**
- **Flexible Deployment**: Can be integrated into other projects with different directory structures
- **Centralized Configuration**: Single point of configuration for all data file paths
- **Maintainable Code**: Eliminates hardcoded paths throughout the codebase
- **Easy Testing**: Can use temporary directories for testing without affecting production data
- **Future-Proof**: Makes it easier to change data directory structure in the future
### **Circular Dependency Solution**
The original implementation had a circular dependency problem:
- **Problem**: `config.json` was located in the `data/` directory
- **Issue**: To read the config file, we needed to know where the data directory is
- **Conflict**: But the data directory location is specified in the config file
- **Solution**: Moved `config.json` to the `config/` directory as a fixed location
- **Result**: Config file is always accessible in a dedicated config directory, and data directory can be configured within it
- **Backward Compatibility**: System still works with config files in custom data directories when explicitly specified
## 🔧 Recent Bug Fixes & Improvements (v3.4.6)
### **Dry Run Mode**
- **New `--dry-run` parameter**: Build download plan and show what would be downloaded without actually downloading anything
- **Plan preview**: Shows total videos in plan and preview of first 5 videos
- **Safe testing**: Test download configurations without consuming bandwidth or disk space
- **All mode support**: Works with all download modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel)
- **Progress simulation**: Shows what the download process would look like without executing it
### **Benefits of Dry Run Mode**
- **Safe testing**: Test complex download configurations without downloading anything
- **Plan validation**: Verify that the download plan contains the expected videos
- **Configuration debugging**: Troubleshoot download settings before committing to downloads
- **Resource conservation**: Save bandwidth and disk space during testing
- **User education**: Help users understand what the tool will do before running it
### **Example Usage**
```bash
# Test songlist download plan
python download_karaoke.py --songlist-only --limit 5 --dry-run
# Test channel download plan
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run
# Test with fuzzy matching
python download_karaoke.py --songlist-only --fuzzy-match --limit 3 --dry-run
```
### **Future Development Guidelines**

352
README.md
View File

@ -1,6 +1,6 @@
# 🎤 Karaoke Video Downloader
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration.
A Python-based cross-platform CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp`, with advanced tracking, songlist prioritization, and flexible configuration. Supports Windows and macOS with automatic platform detection.
## ✨ Features
- 🎵 **Channel & Playlist Downloads**: Download all videos from a YouTube channel or playlist
@ -13,7 +13,7 @@ A Python-based Windows CLI tool to download karaoke videos from YouTube channels
- 📈 **Real-Time Progress**: Detailed console and log output
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- 🧩 **Fuzzy Matching**: Optionally use fuzzy string matching for songlist-to-video matching (with --fuzzy-match, requires rapidfuzz for best results)
- 🧩 **Enhanced Fuzzy Matching**: Advanced fuzzy string matching for songlist-to-video matching with improved video title parsing (handles multiple title formats like "Title Karaoke | Artist Karaoke Version")
- ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
@ -21,10 +21,20 @@ A Python-based Windows CLI tool to download karaoke videos from YouTube channels
- ⚡ **Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
- ⚡ **Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup)
- 📊 **Unmatched Songs Reports**: Generate detailed reports of songs that couldn't be found in any channel with `--generate-unmatched-report`
- 🛡️ **Duplicate File Prevention**: Automatically detects and prevents duplicate files with `(2)`, `(3)` suffixes, with cleanup utility for existing duplicates
- 🏷️ **Consistent Metadata**: Filename and ID3 tag use identical artist/title format for clear file identification
- 🍎 **macOS Support**: Automatic platform detection and setup with native macOS binaries and FFmpeg integration
## 🏗️ Architecture
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
### **Configurable Data Directory (v3.4.7)**
- **Centralized Data Path Management**: `data_path_manager.py` provides unified data directory path management
- **Configurable Location**: Data directory path can be set in `config/config.json` under `folder_structure.data_dir`
- **Backward Compatibility**: Defaults to "data" directory if not configured
- **Cross-Project Integration**: Enables the karaoke downloader to be used as a component in other projects with different data directory structures
### Core Modules:
- **`downloader.py`**: Main orchestrator and CLI interface
- **`video_downloader.py`**: Core video download execution and orchestration
@ -46,47 +56,192 @@ The codebase has been comprehensively refactored into a modular architecture wit
- **`tracking_cli.py`**: Tracking management CLI
### New Utility Modules (v3.3):
- **`parallel_downloader.py`**: Parallel download management with thread-safe operations
- `ParallelDownloader` class: Manages concurrent downloads with configurable workers
- `DownloadTask` and `DownloadResult` dataclasses: Structured task and result management
- Thread-safe progress tracking and error handling
- Automatic retry mechanism for failed downloads
- **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation
- `sanitize_filename()`: Create safe filenames from artist/title
- `generate_possible_filenames()`: Generate filename patterns for different modes
- `check_file_exists_with_patterns()`: Check for existing files using multiple patterns
- `is_valid_mp4_file()`: Validate MP4 files with header checking
- `cleanup_temp_files()`: Remove temporary yt-dlp files
- `ensure_directory_exists()`: Safe directory creation
- **`song_validator.py`**: Centralized song validation logic for checking if songs should be downloaded
- **`song_validator.py`**: Centralized song validation logic
- `SongValidator` class: Unified logic for checking if songs should be downloaded
- `should_skip_song()`: Comprehensive validation with multiple criteria
- `mark_song_failed()`: Consistent failure tracking
- `handle_download_failure()`: Standardized error handling
### New Utility Modules (v3.4.7):
- **`data_path_manager.py`**: Centralized data directory path management and file path resolution
- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses
- `ConfigManager` class: Type-safe configuration loading and caching
- `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses
- Configuration validation and merging with defaults
- Dynamic resolution updates
### **Unified Download Workflow (v3.4.5)**
- **`execute_unified_download_workflow()`**: Centralized download execution that all modes use
- **`_execute_sequential_downloads()`**: Sequential download execution using DownloadPipeline
- **`_execute_parallel_downloads()`**: Parallel download execution using ParallelDownloader
### Benefits:
### **Benefits of Enhanced Modular Architecture:**
- **Single Responsibility**: Each module has a focused purpose
- **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized
- **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules
- **Testability**: Individual components can be tested separately
- **Maintainability**: Easier to find and fix issues
- **Reusability**: Components can be used independently
- **Robustness**: Better error handling and interruption recovery
- **Consistency**: Standardized error messages and processing pipelines
- **Maintainability**: Changes isolated to specific modules
- **Testability**: Modular components can be tested independently
- **Type Safety**: Comprehensive type hints across all new modules
- **Unified Execution**: All download modes use the same execution pipeline for consistency
## 🔧 Development Guidelines
### **Adding New Download Modes**
When adding new download modes, follow the unified workflow pattern to ensure consistency:
#### **1. Build Download Plan (Mode-Specific)**
```python
def download_new_mode(self, ...):
# Build download plan with standard structure
download_plan = []
for video in videos_to_download:
download_plan.append({
"video_id": video["id"],
"artist": artist,
"title": title,
"filename": filename,
"channel_name": channel_name,
"video_title": video["title"],
"force_download": force_download
})
# Use unified execution workflow
downloaded_count, success = self.execute_unified_download_workflow(
download_plan=download_plan,
cache_file=cache_file,
limit=limit,
show_progress=True,
)
return success
```
#### **2. Key Principles**
- **NEVER implement custom download execution logic** - always use `execute_unified_download_workflow()`
- **Focus on download plan building** - that's where mode-specific logic belongs
- **Use the standard download plan structure** for consistency
- **Implement cache file handling** for progress tracking and resume functionality
- **Test with both sequential and parallel modes** to ensure compatibility
#### **3. Benefits of Unified Architecture**
- **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- **Automatic Features**: New modes automatically get parallel downloads, progress tracking, and cache management
- **Maintainability**: Changes to download execution only need to be made in one place
- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
## 🔧 Recent Improvements (v3.4.1)
### **Enhanced Fuzzy Matching**
- **Improved title parsing**: Enhanced `extract_artist_title` function to handle multiple video title formats
- **Better matching accuracy**: Can now parse titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version"
- **Consistent parsing**: All modules now use the same parsing logic from `fuzzy_matcher.py`
- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found
### **Fixed Import Conflicts**
- **Resolved import conflicts**: Updated modules to use the enhanced `extract_artist_title` from `fuzzy_matcher.py`
- **Consistent behavior**: All parts of the system use the same parsing logic
- **Cleaner codebase**: Eliminated duplicate code and import conflicts
### **Fixed --limit Parameter**
- **Correct limit application**: The `--limit` parameter now properly limits the scanning phase, not just downloads
- **Improved performance**: When using `--limit N`, only the first N songs are scanned, significantly reducing processing time
- **Accurate logging**: Logging messages now show the correct counts for songs that will actually be processed when using `--limit`
### **Code Quality Improvements**
- **Eliminated duplicate functions**: Removed duplicate `extract_artist_title` implementations
- **Fixed import conflicts**: Resolved inconsistencies between different parsing implementations
- **Single source of truth**: All title parsing logic is now centralized in `fuzzy_matcher.py`
## 🔧 Recent Improvements (v3.4.5)
### **Unified Download Workflow Architecture**
- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines
- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling
- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution
- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation
- **Unified cache management**: Consistent progress tracking and resume functionality across all modes
### **What Was Fixed**
- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies
- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented
- **Code Duplication**: Eliminated duplicate download execution logic across different modes
- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management
### **Benefits**
- ✅ **Consistency**: All modes behave identically for execution, progress tracking, and error handling
- ✅ **Maintainability**: Changes to download execution only need to be made in one place
- ✅ **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes
- ✅ **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.)
- ✅ **Testing**: Easier to test since all modes use the same execution logic
## 🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)
### **Duplicate File Prevention**
- **Enhanced file existence checking**: Now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
- **Automatic duplicate prevention**: Skips downloads when files already exist (including duplicates)
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files
- **Cleanup utility**: `data/cleanup_duplicate_files.py` helps identify and remove existing duplicate files
### **Filename vs ID3 Tag Consistency**
- **Consistent metadata**: Filename and ID3 tag now use identical artist/title format
- **Removed extra suffixes**: No more "(Karaoke Version)" in ID3 tags that don't match filenames
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction
### **Benefits**
- ✅ **No more duplicate files** with `(2)`, `(3)` suffixes
- ✅ **Consistent metadata** between filename and ID3 tags
- ✅ **Efficient disk usage** by preventing unnecessary downloads
- ✅ **Clear file identification** with consistent naming
### **Clean Up Existing Duplicates**
```bash
# Run the cleanup utility to find and remove existing duplicates
python data/cleanup_duplicate_files.py
# Choose option 1 for dry run (recommended first)
# Choose option 2 to actually delete duplicates
```
## 📋 Requirements
- **Windows 10/11**
- **Windows 10/11 or macOS 10.14+**
- **Python 3.7+**
- **yt-dlp.exe** (in `downloader/`)
- **yt-dlp binary** (platform-specific, see setup instructions below)
- **mutagen** (for ID3 tagging, optional)
- **ffmpeg/ffprobe** (for video validation, optional but recommended)
- **rapidfuzz** (for fuzzy matching, optional, falls back to difflib)
## 🍎 macOS Setup
### Automatic Setup (Recommended)
Run the macOS setup script to automatically set up yt-dlp and FFmpeg:
```bash
python3 setup_macos.py
```
This script will:
- Detect your macOS version
- Offer installation options for yt-dlp (pip or binary download)
- Install FFmpeg via Homebrew
- Test the installation
### Manual Setup
If you prefer to set up manually:
#### Option 1: Install yt-dlp via pip
```bash
pip3 install yt-dlp
```
#### Option 2: Download yt-dlp binary
```bash
mkdir -p downloader
curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
chmod +x downloader/yt-dlp_macos
```
#### Install FFmpeg
```bash
brew install ffmpeg
```
### Test Installation
```bash
python3 src/tests/test_macos.py
```
## 🚀 Quick Start
> **💡 Pro Tip**: For a complete list of all available commands, see `commands.txt` - you can copy/paste any command directly into your terminal!
@ -96,6 +251,21 @@ The codebase has been comprehensively refactored into a modular architecture wit
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
```
### Download ALL Videos from a Channel (Not Just Songlist Matches)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
```
### Download ALL Videos with Parallel Processing
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
```
### Download ALL Videos with Limit
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
```
### Download Only Songlist Songs (Fast Mode)
```bash
python download_karaoke.py --songlist-only --limit 5
@ -103,7 +273,7 @@ python download_karaoke.py --songlist-only --limit 5
### Download with Parallel Processing
```bash
python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
python download_karaoke.py --parallel --songlist-only --limit 10
```
### Focus on Specific Playlists by Title
@ -111,11 +281,31 @@ python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"
```
### Focus on Specific Playlists from Custom File
```bash
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
```
### Force Download from Channels (Bypass All Existing File Checks)
```bash
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
```
### Download with Fuzzy Matching
```bash
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
```
### Test Download Plan (Dry Run)
```bash
python download_karaoke.py --songlist-only --limit 5 --dry-run
```
### Test Channel Download Plan (Dry Run)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 10 --dry-run
```
### Download Latest N Videos Per Channel
```bash
python download_karaoke.py --latest-per-channel --limit 5
@ -220,19 +410,33 @@ KaroakeVideoDownloader/
│ ├── check_resolution.py # Resolution checker utility
│ ├── resolution_cli.py # Resolution config CLI
│ └── tracking_cli.py # Tracking management CLI
├── data/ # All config, tracking, cache, and songlist files
│ ├── config.json
├── config/ # Configuration files
│ └── config.json # Main configuration file
├── data/ # All tracking, cache, and songlist files
│ ├── karaoke_tracking.json
│ ├── songlist_tracking.json
│ ├── channel_cache.json
│ ├── channels.txt
│ ├── channels.json # Channel configuration with parsing rules
│ └── songList.json
├── utilities/ # Utility scripts and tools
│ ├── add_manual_video.py # Manual video management
│ ├── build_cache_from_raw.py # Cache building utility
│ ├── cleanup_duplicate_files.py # File cleanup utilities
│ ├── cleanup_recent_tracking.py # Tracking cleanup utilities
│ ├── deduplicate_songlist_tracking.py # Data deduplication
│ ├── fix_artist_name_format.py # Data cleanup utilities
│ ├── fix_artist_name_format_simple.py
│ ├── fix_code_quality.py # Development tools
│ ├── reset_and_redownload.py # Maintenance utilities
│ └── songlist_report.py # Reporting utilities
├── downloads/ # All video output
│ └── [ChannelName]/ # Per-channel folders
├── logs/ # Download logs
├── downloader/yt-dlp.exe # yt-dlp binary
├── tests/ # Diagnostic and test scripts
│ └── test_installation.py
├── downloader/yt-dlp.exe # yt-dlp binary (Windows)
├── downloader/yt-dlp_macos # yt-dlp binary (macOS)
├── src/tests/ # Test scripts
│ ├── test_macos.py # macOS setup and functionality tests
│ └── test_platform.py # Platform detection tests
├── download_karaoke.py # Main entry point (thin wrapper)
├── README.md
├── PRD.md
@ -249,6 +453,7 @@ KaroakeVideoDownloader/
- `--songlist-priority`: Prioritize songlist songs in download queue
- `--songlist-only`: Download only songs from the songlist
- `--songlist-focus <PLAYLIST_TITLE1> <PLAYLIST_TITLE2>...`: Focus on specific playlists by title (e.g., `--songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100"`)
- `--songlist-file <FILE_PATH>`: Custom songlist file path to use with --songlist-focus (default: data/songList.json)
- `--songlist-status`: Show songlist download progress
- `--limit <N>`: Limit number of downloads (enables fast mode with early exit)
- `--resolution <720p|1080p|...>`: Override resolution
@ -260,8 +465,14 @@ KaroakeVideoDownloader/
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
- `--parallel`: Enable parallel downloads for improved speed
- `--workers <N>`: Number of parallel download workers (1-10, default: 3)
- `--parallel`: Enable parallel downloads for improved speed (defaults to 3 workers)
- `--workers <N>`: Number of parallel download workers (1-10, default: 3, only used with --parallel)
- `--generate-songlist <DIR1> <DIR2>...`: **Generate song list from MP4 files with ID3 tags in specified directories**
- `--no-append-songlist`: **Create a new song list instead of appending when using --generate-songlist**
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files**
- `--dry-run`: **Build download plan and show what would be downloaded without actually downloading anything**
## 📝 Example Usage
@ -272,30 +483,61 @@ KaroakeVideoDownloader/
python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-threshold 85
# Parallel downloads for faster processing
python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
python download_karaoke.py --parallel --songlist-only --limit 10
# Latest videos per channel with parallel downloads
python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5
python download_karaoke.py --parallel --latest-per-channel --limit 5
# Traditional full scan (no limit)
python download_karaoke.py --songlist-only
# Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
# Focus on specific playlists from a custom file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --limit 10
# Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
# Channel-specific operations
python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates
# Download ALL videos from a specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Song list generation from MP4 files
python download_karaoke.py --generate-songlist /path/to/mp4/directory
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
# Generate report of songs that couldn't be found
python download_karaoke.py --generate-unmatched-report
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
```
## 🏷️ ID3 Tagging
- Adds artist/title/album/genre to MP4 files using mutagen (if installed)
## 📋 Song List Generation
- **Generate song lists from existing MP4 files**: Use `--generate-songlist` to create song lists from directories containing MP4 files with ID3 tags
- **Automatic ID3 extraction**: Extracts artist and title from MP4 files' ID3 tags
- **Directory-based organization**: Each directory becomes a playlist with the directory name as the title
- **Position tracking**: Songs are numbered starting from 1 based on file order
- **Append or replace**: Choose to append to existing song list or create a new one with `--no-append-songlist`
- **Multiple directories**: Process multiple directories in a single command
## 🧹 Cleanup
- Removes `.info.json` and `.meta` files after download
## 🛠️ Configuration
- All options are in `data/config.json` (format, resolution, metadata, etc.)
- All options are in `config/config.json` (format, resolution, metadata, etc.)
- You can edit this file or use CLI flags to override
- **Configurable Data Directory**: The data directory path can be configured in `config/config.json` under `folder_structure.data_dir` (default: "data")
## 📋 Command Reference File
@ -311,6 +553,31 @@ python download_karaoke.py --clear-server-duplicates
> **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
## 📚 Documentation Standards
### **Documentation Location**
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
- **Use the existing sections in PRD.md and README.md to track all project evolution**
### **Where to Document Changes**
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
- **README.md**: User-facing features, usage instructions, and high-level improvements
- **CHANGELOG.md**: Version-specific release notes and change summaries
### **Documentation Requirements**
- **All new features must be documented in both PRD.md and README.md**
- **All refactoring efforts must be documented in the appropriate sections**
- **All bug fixes must be documented with technical details**
- **Version numbers and dates should be clearly marked**
- **Benefits and improvements should be explicitly stated**
### **Maintenance Responsibility**
- **Keep PRD.md and README.md synchronized with code changes**
- **Update documentation immediately when implementing new features**
- **Remove outdated information and consolidate related changes**
- **Ensure all CLI options and features are documented in both files**
## 🔧 Refactoring Improvements (v3.3)
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
@ -348,7 +615,7 @@ The codebase has been comprehensively refactored to improve maintainability and
### **New Parallel Download System (v3.4)**
- **Parallel downloader module:** `parallel_downloader.py` provides thread-safe concurrent download management
- **Configurable concurrency:** Use `--parallel --workers N` to enable parallel downloads with N workers (1-10)
- **Configurable concurrency:** Use `--parallel` to enable parallel downloads with 3 workers by default, or `--parallel --workers N` for custom worker count (1-10)
- **Thread-safe operations:** All tracking, caching, and progress operations are thread-safe
- **Real-time progress tracking:** Shows active downloads, completion status, and overall progress
- **Automatic retry mechanism:** Failed downloads are automatically retried with reduced concurrency
@ -372,7 +639,8 @@ The codebase has been comprehensively refactored to improve maintainability and
- **Robust download plan execution:** Fixed index management in download plan execution to prevent errors during interrupted downloads.
## 🐞 Troubleshooting
- Ensure `yt-dlp.exe` is in the `downloader/` folder
- **Windows**: Ensure `yt-dlp.exe` is in the `downloader/` folder
- **macOS**: Run `python3 setup_macos.py` to set up yt-dlp and FFmpeg
- Check `logs/` for error details
- Use `python -m karaoke_downloader.check_resolution` to verify video quality
- If you see errors about ffmpeg/ffprobe, install [ffmpeg](https://ffmpeg.org/download.html) and ensure it is in your PATH

View File

@ -1,6 +1,6 @@
# 🎤 Karaoke Video Downloader - CLI Commands Reference
# Copy and paste these commands into your terminal
# Updated: v3.4 (includes parallel downloads and all refactoring improvements)
# Updated: v3.4.4 (includes macOS support, all videos download mode, manual video collection, channel parsing rules, and all previous improvements)
## 📥 BASIC DOWNLOADS
@ -8,7 +8,7 @@
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
# Download from a file containing multiple channel URLs
python download_karaoke.py --file data/channels.txt
python download_karaoke.py --file data/channels.json
# Download with custom resolution (480p, 720p, 1080p, 1440p, 2160p)
python download_karaoke.py --resolution 1080p https://www.youtube.com/@SingKingKaraoke/videos
@ -19,9 +19,69 @@ python download_karaoke.py --limit 10 https://www.youtube.com/@SingKingKaraoke/v
# Enable parallel downloads for faster processing (3-5x speedup)
python download_karaoke.py --parallel --workers 5 --limit 10 https://www.youtube.com/@SingKingKaraoke/videos
## 🎤 MANUAL VIDEO COLLECTION (v3.4.3)
# Download from manual videos collection (data/manual_videos.json)
python download_karaoke.py --manual --limit 5
# Download manual videos with fuzzy matching
python download_karaoke.py --manual --fuzzy-match --fuzzy-threshold 85 --limit 10
# Download manual videos with parallel processing
python download_karaoke.py --parallel --workers 3 --manual --limit 5
# Download manual videos with songlist matching
python download_karaoke.py --manual --songlist-only --limit 10
# Force download from manual videos (bypass existing file checks)
python download_karaoke.py --manual --force --limit 5
# Add a video to manual collection (interactive)
python utilities/add_manual_video.py add "Artist - Song Title (Karaoke Version)" "https://www.youtube.com/watch?v=VIDEO_ID"
# List all manual videos
python utilities/add_manual_video.py list
# Remove a video from manual collection
python utilities/add_manual_video.py remove "Artist - Song Title (Karaoke Version)"
## 🎬 ALL VIDEOS DOWNLOAD MODE (v3.4.4)
# Download ALL videos from a specific channel (not just songlist matches)
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
# Download ALL videos with parallel processing for speed
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
# Download ALL videos with limit (download first N videos)
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Download ALL videos with parallel processing and limit
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 5 --limit 50
# Download ALL videos from ZoomKaraokeOfficial channel
python download_karaoke.py --channel-focus ZoomKaraokeOfficial --all-videos
# Download ALL videos with custom resolution
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --resolution 1080p
## 📋 SONG LIST GENERATION
# Generate song list from MP4 files in a directory (append to existing song list)
python download_karaoke.py --generate-songlist /path/to/mp4/directory
# Generate song list from multiple directories
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 /path/to/dir3
# Generate song list and create a new song list file (don't append)
python download_karaoke.py --generate-songlist /path/to/mp4/directory --no-append-songlist
# Generate song list from multiple directories and create new file
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist
## 🎵 SONGLIST OPERATIONS
# Download only songs from your songlist (uses data/channels.txt by default)
# Download only songs from your songlist (uses data/channels.json by default)
python download_karaoke.py --songlist-only
# Download only songlist songs with limit
@ -51,6 +111,18 @@ python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --limit 5
# Focus on specific playlists with parallel processing
python download_karaoke.py --parallel --workers 3 --songlist-focus "2025 - Apple Top 50" --limit 5
# Focus on specific playlists from a custom songlist file
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json"
# Focus on specific playlists from a custom file with force mode
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --force
# Force download from channels regardless of existing files or server duplicates
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force
# Force download with parallel processing
python download_karaoke.py --parallel --workers 5 --songlist-focus "2025 - Apple Top 50" --force --limit 10
# Prioritize songlist songs in download queue (default behavior)
python download_karaoke.py --songlist-priority https://www.youtube.com/@SingKingKaraoke/videos
@ -60,6 +132,35 @@ python download_karaoke.py --no-songlist-priority https://www.youtube.com/@SingK
# Show songlist download status and statistics
python download_karaoke.py --songlist-status
## 📊 UNMATCHED SONGS REPORTS
# Generate report of songs that couldn't be found in any channel (standalone)
python download_karaoke.py --generate-unmatched-report
# Generate report with fuzzy matching enabled (standalone)
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 85
# Generate report using a specific channel file (standalone)
python download_karaoke.py --generate-unmatched-report --file data/my_channels.txt
# Generate report from a custom songlist file (standalone)
python download_karaoke.py --generate-unmatched-report --songlist-file "data/my_custom_songlist.json"
# Generate report with focus on specific playlists from a custom file (standalone)
python download_karaoke.py --songlist-focus "CCKaraoke" --songlist-file "data/my_custom_songlist.json" --generate-unmatched-report
# Download songs AND generate unmatched report (additive feature)
python download_karaoke.py --songlist-only --limit 10 --generate-unmatched-report
# Download with fuzzy matching AND generate unmatched report
python download_karaoke.py --songlist-only --fuzzy-match --fuzzy-threshold 85 --limit 10 --generate-unmatched-report
# Download from specific playlists AND generate unmatched report
python download_karaoke.py --songlist-focus "CCKaraoke" --limit 10 --generate-unmatched-report
# Generate report with custom fuzzy threshold
python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-threshold 80
## ⚡ PARALLEL DOWNLOADS (v3.4)
# Basic parallel downloads (3-5x faster than sequential)
@ -94,7 +195,7 @@ python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5
python download_karaoke.py --parallel --workers 3 --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
# Download latest videos from specific channels file
python download_karaoke.py --latest-per-channel --limit 5 --file data/channels.txt
python download_karaoke.py --latest-per-channel --limit 5 --file data/channels.json
## 🔄 CACHE & TRACKING MANAGEMENT
@ -153,7 +254,7 @@ python download_karaoke.py --version
python download_karaoke.py --songlist-only --limit 20 --fuzzy-match --fuzzy-threshold 85 --resolution 1080p
# Latest videos per channel with fuzzy matching
python download_karaoke.py --latest-per-channel --limit 3 --fuzzy-match --fuzzy-threshold 90 --file data/channels.txt
python download_karaoke.py --latest-per-channel --limit 3 --fuzzy-match --fuzzy-threshold 90 --file data/channels.json
# Force refresh everything and download songlist
python download_karaoke.py --songlist-only --force-download-plan --refresh --limit 10
@ -172,6 +273,9 @@ python download_karaoke.py --parallel --workers 5 --songlist-only --limit 10
# 1b. Focus on specific playlists (fast targeted download)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --limit 5
# 1c. Force download from specific playlists (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --limit 5
# 2. Latest videos from all channels
python download_karaoke.py --latest-per-channel --limit 5
@ -190,6 +294,9 @@ python download_karaoke.py --parallel --workers 5 --songlist-only --fuzzy-match
# 4b. Focused fuzzy matching (target specific playlists with flexible matching)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --fuzzy-match --fuzzy-threshold 80 --limit 10
# 4c. Force download with fuzzy matching (bypass all existing file checks)
python download_karaoke.py --songlist-focus "2025 - Apple Top 50" --force --fuzzy-match --fuzzy-threshold 80 --limit 10
# 5. Reset and start fresh
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
@ -197,6 +304,33 @@ python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --status
python download_karaoke.py --clear-cache all
# 7. Download from manual video collection
python download_karaoke.py --manual --limit 5
# 7b. Fast parallel manual video download
python download_karaoke.py --parallel --workers 3 --manual --limit 5
# 7c. Manual videos with fuzzy matching
python download_karaoke.py --manual --fuzzy-match --fuzzy-threshold 85 --limit 10
## 🍎 macOS SETUP COMMANDS
# Automatic macOS setup (detects OS and installs yt-dlp + FFmpeg)
python3 setup_macos.py
# Test macOS setup and functionality
python3 src/tests/test_macos.py
# Manual macOS setup options
# Install yt-dlp via pip
pip3 install yt-dlp
# Download yt-dlp binary for macOS
mkdir -p downloader && curl -L -o downloader/yt-dlp_macos https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos && chmod +x downloader/yt-dlp_macos
# Install FFmpeg via Homebrew
brew install ffmpeg
## 🔧 TROUBLESHOOTING COMMANDS
# Check if everything is working
@ -212,7 +346,9 @@ python download_karaoke.py --clear-server-duplicates
## 📝 NOTES
# Default files used:
# - data/channels.txt (default channel list for songlist modes)
# - data/channels.json (channel configuration with parsing rules, preferred)
# - data/channels.json (channel configuration with parsing rules)
# - data/manual_videos.json (manual video collection)
# - data/songList.json (your prioritized song list)
# - data/config.json (download settings)
@ -221,11 +357,12 @@ python download_karaoke.py --clear-server-duplicates
# Fuzzy threshold: 0-100 (higher = more strict matching, default 90)
# The system automatically:
# - Uses data/channels.txt if no --file specified in songlist modes
# - Uses data/channels.json for channel configuration and parsing rules
# - Caches channel data for 24 hours (configurable)
# - Tracks all downloads in JSON files
# - Avoids re-downloading existing files
# - Checks for server duplicates
# - Supports manual video collection via --manual parameter
# For best performance:
# - Use --parallel --workers 5 for 3-5x faster downloads
@ -233,6 +370,7 @@ python download_karaoke.py --clear-server-duplicates
# - Use --fuzzy-match for better song discovery
# - Use --refresh sparingly (forces re-scan)
# - Clear cache if you encounter issues
# - macOS users: Run `python3 setup_macos.py` for automatic setup
# Parallel download tips:
# - Start with --workers 3 for conservative approach

View File

@ -19,13 +19,14 @@
"writethumbnail": false,
"embed_metadata": false,
"continuedl": true,
"nooverwrites": true,
"nooverwrites": false,
"ignoreerrors": true,
"no_warnings": false
},
"folder_structure": {
"downloads_dir": "downloads",
"logs_dir": "logs",
"data_dir": "data",
"tracking_file": "downloaded_videos.json"
},
"logging": {
@ -34,5 +35,12 @@
"include_console": true,
"include_file": true
},
"platform_settings": {
"auto_detect_platform": true,
"yt_dlp_paths": {
"windows": "downloader/yt-dlp.exe",
"macos": "downloader/yt-dlp_macos"
}
},
"yt_dlp_path": "downloader/yt-dlp.exe"
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,19 @@
{
"channel_id": "@LetsSingKaraoke",
"videos": [
{
"title": "Sub Urban - Cradles | Karaoke (instrumental)",
"id": "8uj7IzhdiO4"
},
{
"title": "Sia - Snowman | Karaoke (instrumental)",
"id": "ZbWHuncTgsM"
},
{
"title": "Trevor Daniel - Falling | Karaoke (Instrumental)",
"id": "nU7n2aq7f98"
}
],
"last_updated": "2025-08-05T15:59:09.280488",
"video_count": 3
}

View File

@ -0,0 +1,10 @@
# Raw yt-dlp output for @LetsSingKaraoke
# Channel URL: https://www.youtube.com/@LetsSingKaraoke/videos
# Command: downloader/yt-dlp_macos --flat-playlist --print %(title)s|%(id)s|%(url)s --verbose https://www.youtube.com/@LetsSingKaraoke/videos
# Timestamp: 2025-08-05T15:59:09.280155
# Total lines: 3
################################################################################
1: Sub Urban - Cradles | Karaoke (instrumental)|8uj7IzhdiO4|https://www.youtube.com/watch?v=8uj7IzhdiO4
2: Sia - Snowman | Karaoke (instrumental)|ZbWHuncTgsM|https://www.youtube.com/watch?v=ZbWHuncTgsM
3: Trevor Daniel - Falling | Karaoke (Instrumental)|nU7n2aq7f98|https://www.youtube.com/watch?v=nU7n2aq7f98

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

191
data/channels.json Normal file
View File

@ -0,0 +1,191 @@
{
"channels": [
{
"name": "@SingKingKaraoke",
"url": "https://www.youtube.com/@SingKingKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "Karaoke Version"]
}
},
"examples": [
"Artist - Title (Karaoke)",
"Artist - Title (Karaoke Version)"
]
},
"description": "Standard artist - title format with karaoke suffix"
},
{
"name": "@KaraokeOnVEVO",
"url": "https://www.youtube.com/@KaraokeOnVEVO/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)"]
}
},
"examples": [
"George Jones - A Picture Of Me (Without You) (Karaoke)",
"Iggy Pop, Kate Pierson - Candy (Karaoke)"
]
},
"description": "Standard artist - title format with (Karaoke) suffix"
},
{
"name": "@StingrayKaraoke",
"url": "https://www.youtube.com/@StingrayKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke Version)"]
}
},
"playlist_indicators": [
"TOP SONGS OF",
"THE BEST",
"BEST",
"NON-STOP",
"MASHUP",
"FEAT.",
"WITH LYRICS"
],
"examples": [
"Gracie Abrams - That's So True (Karaoke Version)",
"TOP SONGS OF 2024 KARAOKE WITH LYRICS BY BILLIE EILISH, GRACIE ABRAMS & MORE"
]
},
"description": "Standard artist - title format with (Karaoke Version) suffix, also has playlist titles"
},
{
"name": "@sing2karaoke",
"url": "https://www.youtube.com/@sing2karaoke/videos",
"parsing_rules": {
"format": "artist_title_spaces",
"separator": " ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke Version) Lyrics", "(Karaoke Version)", "Karaoke Version Lyrics"]
}
},
"multi_artist_separator": ", ",
"examples": [
"Lauren Spencer Smith Fingers Crossed",
"Calvin Harris, Clementine Douglas Blessings (Karaoke Version) Lyrics"
]
},
"description": "Artist and title separated by multiple spaces, supports multiple artists"
},
{
"name": "@ZoomKaraokeOfficial",
"url": "https://www.youtube.com/@ZoomKaraokeOfficial/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": [
"(Karaoke)",
"(Karaoke Version)",
"Karaoke Version",
"- Karaoke Version from Zoom Karaoke",
"- Karaoke Version from Zoom",
"- Karaoke Version from Zoom Karaoke (Radiohead Cover)",
"- Karaoke Version from Zoom (Radiohead Cover)"
]
}
},
"examples": [
"The Mavericks - Here Comes My Baby - Karaoke Version from Zoom Karaoke"
]
},
"description": "Standard artist - title format with '- Karaoke Version from Zoom Karaoke' suffix"
},
{
"name": "@VocalStarKaraoke",
"url": "https://www.youtube.com/@VocalStarKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": false,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["KARAOKE Without Backing Vocals", "KARAOKE With Vocal Guide", "KARAOKE"]
}
},
"examples": [
"Don't Say You Love Me - Jin KARAOKE Without Backing Vocals",
"Don't Say You Love Me - Jin KARAOKE With Vocal Guide"
]
},
"description": "Title first, then dash separator, then artist with KARAOKE suffix"
},
{
"name": "@ManualVideos",
"url": "manual://static",
"manual_videos_file": "data/manual_videos.json",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "(Karaoke Version) Lyrics"]
}
}
},
"description": "Manual collection of individual karaoke videos (static, never expires)"
},
{
"name": "Let's Sing Karaoke",
"url": "https://www.youtube.com/@LetsSingKaraoke/videos",
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "Karaoke Version", "(In the style of)"]
}
},
"examples": [
"Artist - Title (Karaoke)",
"Artist - Title (In the style of Other Artist)"
]
},
"artist_name_processing": true,
"description": "Let's Sing Karaoke with enhanced artist name processing"
}
],
"global_parsing_settings": {
"fallback_format": "artist_title_separator",
"fallback_separator": " - ",
"common_suffixes": [
"(Karaoke)",
"(Karaoke Version)",
"Karaoke Version",
"(Karaoke Version) Lyrics",
"Karaoke Version Lyrics"
],
"playlist_indicators": [
"TOP",
"BEST",
"MASHUP",
"FEAT.",
"WITH LYRICS",
"NON-STOP",
"PLAYLIST"
]
}
}

View File

@ -1,7 +0,0 @@
https://www.youtube.com/@SingKingKaraoke/videos
https://www.youtube.com/@karafun/videos
https://www.youtube.com/@KaraokeOnVEVO/videos
https://www.youtube.com/@StingrayKaraoke/videos
https://www.youtube.com/@CCKaraoke/videos
https://www.youtube.com/@AtomicKaraoke/videos
https://www.youtube.com/@sing2karaoke/videos

115120
data/karaoke_tracking.json Normal file

File diff suppressed because it is too large Load Diff

85
data/manual_videos.json Normal file
View File

@ -0,0 +1,85 @@
{
"channel_name": "@ManualVideos",
"channel_url": "manual://static",
"description": "Manual collection of individual karaoke videos",
"videos": [
{
"title": "Nickelback - Photograph",
"url": "https://www.youtube.com/watch?v=qZXwpceqt9s",
"id": "qZXwpceqt9s",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Ed Sheeran & Beyoncé - Perfect Duet",
"url": "https://www.youtube.com/watch?v=qegLWI99Wg0",
"id": "qegLWI99Wg0",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "10,000 Maniacs - More Than This",
"url": "https://www.youtube.com/watch?v=wxnuF-APJ5M",
"id": "wxnuF-APJ5M",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "AC/DC - Big Balls",
"url": "https://www.youtube.com/watch?v=kiSDpVmu4Bk",
"id": "kiSDpVmu4Bk",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Jon Bon Jovi - Blaze of Glory",
"url": "https://www.youtube.com/watch?v=SzRAoDMlQY",
"id": "SzRAoDMlQY",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "ZZ Top - Sharp Dressed Man",
"url": "https://www.youtube.com/watch?v=prRalwto9iY",
"id": "prRalwto9iY",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Nickelback - Photograph",
"url": "https://www.youtube.com/watch?v=qTphCTAUhUg",
"id": "qTphCTAUhUg",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Billy Joel - Shes Got A Way",
"url": "https://www.youtube.com/watch?v=DeeTFIgKuC8",
"id": "DeeTFIgKuC8",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
}
],
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": [
"(Karaoke)",
"(Karaoke Version)",
"(Karaoke Version) Lyrics"
]
}
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -23902,7 +23902,7 @@
"title": "Superman (It's Not Easy)"
},
{
"artist": "'N Sync",
"artist": "'NSync",
"position": 16,
"title": "Gone"
},
@ -24122,7 +24122,7 @@
"title": "Turn Off The Light"
},
{
"artist": "'N Sync",
"artist": "'NSync",
"position": 13,
"title": "Gone"
},
@ -24617,7 +24617,7 @@
"title": "Most Girls"
},
{
"artist": "'N Sync",
"artist": "'NSync",
"position": 11,
"title": "This I Promise You"
},
@ -24857,7 +24857,7 @@
"title": "I Just Wanna Love U (Give It 2 Me)"
},
{
"artist": "'N Sync",
"artist": "'NSync",
"position": 12,
"title": "This I Promise You"
},
@ -25857,7 +25857,7 @@
"title": "Tha Block Is Hot"
},
{
"artist": "'N Sync & Gloria Estefan",
"artist": "'NSync & Gloria Estefan",
"position": 85,
"title": "Music Of My Heart"
},
@ -26237,7 +26237,7 @@
"title": "Touch It"
},
{
"artist": "N Sync",
"artist": "NSync",
"position": 34,
"title": "(God Must Have Spent) A Little More Time On You"
},

93214
data/songlist_tracking.json Normal file

File diff suppressed because it is too large Load Diff

BIN
downloader/yt-dlp_macos Executable file

Binary file not shown.

View File

@ -9,6 +9,8 @@ import json
from datetime import datetime, timedelta
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
# Constants
DEFAULT_CACHE_EXPIRATION_DAYS = 1
DEFAULT_CACHE_FILENAME_LENGTH_LIMIT = 200 # Increased from 60
@ -37,7 +39,7 @@ def get_download_plan_cache_file(mode, **kwargs):
+ hashlib.md5(base.encode()).hexdigest()[:8]
)
return Path(f"data/{base}.json")
return get_data_path_manager().get_path(f"{base}.json")
def load_cached_plan(cache_file, max_age_days=DEFAULT_CACHE_EXPIRATION_DAYS):

View File

@ -0,0 +1,260 @@
"""
Channel-specific parsing utilities for extracting artist and title from video titles.
This module handles the different title formats used by various karaoke channels,
providing channel-specific parsing rules to extract artist and title information
correctly for ID3 tagging and filename generation.
"""
import json
import re
from typing import Dict, List, Optional, Tuple, Any
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
class ChannelParser:
"""Handles channel-specific parsing of video titles to extract artist and title."""
def __init__(self, channels_file: str = None):
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_json_path())
"""Initialize the parser with channel configuration."""
self.channels_file = Path(channels_file)
self.channels_config = self._load_channels_config()
def _load_channels_config(self) -> Dict[str, Any]:
"""Load the channels configuration from JSON file."""
if not self.channels_file.exists():
raise FileNotFoundError(f"Channels configuration file not found: {self.channels_file}")
with open(self.channels_file, 'r', encoding='utf-8') as f:
return json.load(f)
def get_channel_config(self, channel_name: str) -> Optional[Dict[str, Any]]:
"""Get the configuration for a specific channel."""
for channel in self.channels_config.get("channels", []):
if channel["name"] == channel_name:
return channel
return None
def extract_artist_title(self, video_title: str, channel_name: str) -> Tuple[str, str]:
"""
Extract artist and title from a video title using channel-specific parsing rules.
Args:
video_title: The full video title from YouTube
channel_name: The name of the channel (must match config)
Returns:
Tuple of (artist, title) - both may be empty strings if parsing fails
"""
channel_config = self.get_channel_config(channel_name)
if not channel_config:
# Fallback to global settings
return self._fallback_parse(video_title)
parsing_rules = channel_config.get("parsing_rules", {})
format_type = parsing_rules.get("format", "artist_title_separator")
if format_type == "artist_title_separator":
return self._parse_artist_title_separator(video_title, parsing_rules)
elif format_type == "artist_title_spaces":
return self._parse_artist_title_spaces(video_title, parsing_rules)
elif format_type == "title_artist_pipe":
return self._parse_title_artist_pipe(video_title, parsing_rules)
else:
return self._fallback_parse(video_title)
def _parse_artist_title_separator(self, video_title: str, rules: Dict[str, Any]) -> Tuple[str, str]:
"""Parse format: 'Artist - Title' or 'Title - Artist'."""
separator = rules.get("separator", " - ")
artist_first = rules.get("artist_first", True)
if separator not in video_title:
return "", video_title.strip()
parts = video_title.split(separator, 1)
if len(parts) != 2:
return "", video_title.strip()
part1, part2 = parts[0].strip(), parts[1].strip()
# Apply cleanup to both parts
part1_clean = self._cleanup_title(part1, rules.get("title_cleanup", {}))
part2_clean = self._cleanup_title(part2, rules.get("title_cleanup", {}))
if artist_first:
return part1_clean, part2_clean
else:
return part2_clean, part1_clean
def _parse_artist_title_spaces(self, video_title: str, rules: Dict[str, Any]) -> Tuple[str, str]:
"""Parse format: 'Artist Title' (multiple spaces)."""
separator = rules.get("separator", " ")
multi_artist_sep = rules.get("multi_artist_separator", ", ")
# Try multiple space patterns to handle inconsistent spacing
# Look for the LAST occurrence of multiple spaces to handle cases with commas
space_patterns = [" ", " ", " "] # 3, 2, 4 spaces
for pattern in space_patterns:
if pattern in video_title:
# Split on the LAST occurrence of the pattern
last_index = video_title.rfind(pattern)
if last_index != -1:
artist_part = video_title[:last_index].strip()
title_part = video_title[last_index + len(pattern):].strip()
# Handle multiple artists (e.g., "Artist1, Artist2")
if multi_artist_sep in artist_part:
# Keep the full artist string as is
artist = artist_part
else:
artist = artist_part
title = self._cleanup_title(title_part, rules.get("title_cleanup", {}))
return artist, title
# Try dash patterns as fallback for inconsistent formatting
dash_patterns = [" - ", " ", " -"] # Regular dash, en dash, dash without trailing space
for pattern in dash_patterns:
if pattern in video_title:
# Split on the LAST occurrence of the pattern
last_index = video_title.rfind(pattern)
if last_index != -1:
artist_part = video_title[:last_index].strip()
title_part = video_title[last_index + len(pattern):].strip()
# Handle multiple artists (e.g., "Artist1, Artist2")
if multi_artist_sep in artist_part:
# Keep the full artist string as is
artist = artist_part
else:
artist = artist_part
title = self._cleanup_title(title_part, rules.get("title_cleanup", {}))
return artist, title
# If no pattern matches, return empty artist and full title
return "", video_title.strip()
def _parse_title_artist_pipe(self, video_title: str, rules: Dict[str, Any]) -> Tuple[str, str]:
"""Parse format: 'Title | Artist'."""
separator = rules.get("separator", " | ")
if separator not in video_title:
return "", video_title.strip()
parts = video_title.split(separator, 1)
if len(parts) != 2:
return "", video_title.strip()
title_part, artist_part = parts[0].strip(), parts[1].strip()
title = self._cleanup_title(title_part, rules.get("title_cleanup", {}))
artist = self._cleanup_title(artist_part, rules.get("artist_cleanup", {}))
return artist, title
def _cleanup_title(self, text: str, cleanup_rules: Dict[str, Any]) -> str:
"""Apply cleanup rules to remove suffixes and normalize text."""
if not cleanup_rules:
return text.strip()
cleaned = text.strip()
# Handle remove_suffix rule
if "remove_suffix" in cleanup_rules:
suffixes = cleanup_rules["remove_suffix"].get("suffixes", [])
for suffix in suffixes:
if cleaned.endswith(suffix):
cleaned = cleaned[:-len(suffix)].strip()
break
return cleaned
def _fallback_parse(self, video_title: str) -> Tuple[str, str]:
"""Fallback parsing using global settings."""
global_settings = self.channels_config.get("global_parsing_settings", {})
fallback_format = global_settings.get("fallback_format", "artist_title_separator")
fallback_separator = global_settings.get("fallback_separator", " - ")
if fallback_format == "artist_title_separator":
if fallback_separator in video_title:
parts = video_title.split(fallback_separator, 1)
if len(parts) == 2:
artist = parts[0].strip()
title = parts[1].strip()
# Apply global suffix cleanup
for suffix in global_settings.get("common_suffixes", []):
if title.endswith(suffix):
title = title[:-len(suffix)].strip()
break
return artist, title
# If all else fails, return empty artist and full title
return "", video_title.strip()
def is_playlist_title(self, video_title: str, channel_name: str) -> bool:
"""Check if a video title appears to be a playlist rather than a single song."""
channel_config = self.get_channel_config(channel_name)
if not channel_config:
return self._is_playlist_by_global_rules(video_title)
parsing_rules = channel_config.get("parsing_rules", {})
playlist_indicators = parsing_rules.get("playlist_indicators", [])
if not playlist_indicators:
return self._is_playlist_by_global_rules(video_title)
title_upper = video_title.upper()
for indicator in playlist_indicators:
if indicator.upper() in title_upper:
return True
return False
def _is_playlist_by_global_rules(self, video_title: str) -> bool:
"""Check if title is a playlist using global rules."""
global_settings = self.channels_config.get("global_parsing_settings", {})
playlist_indicators = global_settings.get("playlist_indicators", [])
title_upper = video_title.upper()
for indicator in playlist_indicators:
if indicator.upper() in title_upper:
return True
return False
def get_all_channel_names(self) -> List[str]:
"""Get a list of all configured channel names."""
return [channel["name"] for channel in self.channels_config.get("channels", [])]
def get_channel_url(self, channel_name: str) -> Optional[str]:
"""Get the URL for a specific channel."""
channel_config = self.get_channel_config(channel_name)
return channel_config.get("url") if channel_config else None
# Convenience function for backward compatibility
def extract_artist_title(video_title: str, channel_name: str, channels_file: str = None) -> Tuple[str, str]:
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_json_path())
"""
Convenience function to extract artist and title from a video title.
Args:
video_title: The full video title from YouTube
channel_name: The name of the channel
channels_file: Path to the channels configuration file
Returns:
Tuple of (artist, title)
"""
parser = ChannelParser(channels_file)
return parser.extract_artist_title(video_title, channel_name)

View File

@ -1,27 +1,117 @@
#!/usr/bin/env python3
"""
Karaoke Video Downloader CLI
Command-line interface for the karaoke video downloader.
"""
import argparse
import os
import sys
from pathlib import Path
from typing import List
from karaoke_downloader.channel_parser import ChannelParser
from karaoke_downloader.config_manager import AppConfig
from karaoke_downloader.data_path_manager import get_data_path_manager
from karaoke_downloader.downloader import KaraokeDownloader
# Constants
DEFAULT_LATEST_PER_CHANNEL_LIMIT = 10
DEFAULT_FUZZY_THRESHOLD = 85
DEFAULT_LATEST_PER_CHANNEL_LIMIT = 5
DEFAULT_DISPLAY_LIMIT = 10
DEFAULT_CACHE_DURATION_HOURS = 24
def load_channels_from_json(channels_file: str = None) -> List[str]:
"""
Load channel URLs from the new JSON format.
Args:
channels_file: Path to the channels.json file (if None, uses default from config)
Returns:
List of channel URLs
"""
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_json_path())
try:
parser = ChannelParser(channels_file)
channels = parser.channels_config.get("channels", [])
return [channel["url"] for channel in channels]
except Exception as e:
print(f"❌ Error loading channels from {channels_file}: {e}")
return []
def load_channels_from_text(channels_file: str = None) -> List[str]:
"""
Load channel URLs from the old text format (for backward compatibility).
Args:
channels_file: Path to the channels.txt file (if None, uses default from config)
Returns:
List of channel URLs
"""
if channels_file is None:
channels_file = str(get_data_path_manager().get_channels_txt_path())
try:
with open(channels_file, "r", encoding="utf-8") as f:
return [
line.strip()
for line in f
if line.strip() and not line.strip().startswith("#")
]
except Exception as e:
print(f"❌ Error loading channels from {channels_file}: {e}")
return []
def load_channels(channel_file: str = None) -> List[str]:
"""Load channel URLs from file."""
if channel_file is None:
# Use JSON configuration
data_path_manager = get_data_path_manager()
if data_path_manager.file_exists("channels.json"):
return load_channels_from_json()
else:
return []
else:
if channel_file.endswith(".json"):
return load_channels_from_json(channel_file)
else:
return load_channels_from_text(channel_file)
def get_channel_url_by_name(channel_name: str) -> str:
"""Look up a channel URL by its name from the channels configuration."""
channel_urls = load_channels()
# Normalize the channel name for comparison
normalized_name = channel_name.lower().replace("@", "").replace("karaoke", "").strip()
for url in channel_urls:
# Extract channel name from URL
if "/@" in url:
url_channel_name = url.split("/@")[1].split("/")[0].lower()
if url_channel_name == normalized_name or url_channel_name.replace("karaoke", "").strip() == normalized_name:
return url
return None
def main():
parser = argparse.ArgumentParser(
description="Karaoke Video Downloader - Download YouTube playlists and channel videos for karaoke",
description="Karaoke Video Downloader - Download YouTube playlists and channel videos for karaoke (default: downloads latest videos from all channels)",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python download_karaoke.py https://www.youtube.com/playlist?list=XYZ
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
python download_karaoke.py --file data/channels.txt
python download_karaoke.py --limit 10 # Download latest 10 videos from all channels
python download_karaoke.py --songlist-only --limit 10 # Download only songlist songs across channels
python download_karaoke.py --channel-focus SingKingKaraoke --limit 5 # Download from specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos # Download ALL videos from channel
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos # Download from specific channel URL
python download_karaoke.py --file data/channels.txt # Download from custom channel list
python download_karaoke.py --reset-channel SingKingKaraoke --delete-files
""",
)
@ -92,13 +182,34 @@ Examples:
parser.add_argument(
"--songlist-priority",
action="store_true",
help="Prioritize downloads based on data/songList.json (default: enabled)",
help="Prioritize downloads based on songList.json in the data directory (default: enabled)",
)
parser.add_argument(
"--no-songlist-priority",
action="store_true",
help="Disable songlist prioritization",
)
parser.add_argument(
"--generate-unmatched-report",
action="store_true",
help="Generate a report of songs that couldn't be found in any channel (runs after downloads)",
)
parser.add_argument(
"--show-pagination",
action="store_true",
help="Show page-by-page progress when downloading channel video lists (slower but more detailed)",
)
parser.add_argument(
"--parallel-channels",
action="store_true",
help="Enable parallel channel scanning for faster channel processing (scans multiple channels simultaneously)",
)
parser.add_argument(
"--channel-workers",
type=int,
default=3,
help="Number of parallel channel scanning workers (default: 3, max: 10)",
)
parser.add_argument(
"--songlist-only",
action="store_true",
@ -110,6 +221,16 @@ Examples:
metavar="PLAYLIST_TITLE",
help='Focus on specific playlists by title (e.g., --songlist-focus "2025 - Apple Top 50" "2024 - Billboard Hot 100")',
)
parser.add_argument(
"--songlist-file",
metavar="FILE_PATH",
help="Custom songlist file path to use with --songlist-focus (default: songList.json in the data directory)",
)
parser.add_argument(
"--force",
action="store_true",
help="Force download from channels regardless of whether songs are already downloaded, on server, or marked as duplicates",
)
parser.add_argument(
"--songlist-status",
action="store_true",
@ -146,7 +267,7 @@ Examples:
parser.add_argument(
"--latest-per-channel",
action="store_true",
help="Download the latest N videos from each channel (use with --limit)",
help="Download the latest N videos from each channel (use with --limit) [DEPRECATED: This is now the default behavior]",
)
parser.add_argument(
"--fuzzy-match",
@ -156,19 +277,50 @@ Examples:
parser.add_argument(
"--fuzzy-threshold",
type=int,
default=90,
help="Fuzzy match threshold (0-100, default 90)",
default=DEFAULT_FUZZY_THRESHOLD,
help=f"Fuzzy match threshold (0-100, default {DEFAULT_FUZZY_THRESHOLD})",
)
parser.add_argument(
"--parallel",
action="store_true",
help="Enable parallel downloads for improved speed",
help="Enable parallel downloads for improved speed (3-5x faster for large batches, defaults to 3 workers)",
)
parser.add_argument(
"--workers",
type=int,
default=3,
help="Number of parallel download workers (default: 3, max: 10)",
help="Number of parallel download workers (default: 3, max: 10, only used with --parallel)",
)
parser.add_argument(
"--generate-songlist",
nargs="+",
metavar="DIRECTORY",
help="Generate song list from MP4 files with ID3 tags in specified directories",
)
parser.add_argument(
"--no-append-songlist",
action="store_true",
help="Create a new song list instead of appending when using --generate-songlist",
)
parser.add_argument(
"--manual",
action="store_true",
help="Download from manual videos collection (manual_videos.json in the data directory)",
)
parser.add_argument(
"--channel-focus",
type=str,
help="Download from a specific channel by name (e.g., 'SingKingKaraoke')",
)
parser.add_argument(
"--all-videos",
action="store_true",
help="Download all videos from channel (not just songlist matches), skipping existing files",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Build download plan and show what would be downloaded without actually downloading anything",
)
args = parser.parse_args()
@ -177,12 +329,42 @@ Examples:
print("❌ Error: --workers must be between 1 and 10")
sys.exit(1)
yt_dlp_path = Path("downloader/yt-dlp.exe")
if not yt_dlp_path.exists():
print("❌ Error: yt-dlp.exe not found in downloader/ directory")
print("Please ensure yt-dlp.exe is present in the downloader/ folder")
# Validate channel workers argument
if args.channel_workers < 1 or args.channel_workers > 10:
print("❌ Error: --channel-workers must be between 1 and 10")
sys.exit(1)
# Load configuration to get platform-aware yt-dlp path
from karaoke_downloader.config_manager import load_config
config = load_config()
yt_dlp_path = config.yt_dlp_path
# Check if it's a command string (like "python3 -m yt_dlp") or a file path
if yt_dlp_path.startswith(('python', 'python3')):
# It's a command string, test if it works
try:
import subprocess
cmd = yt_dlp_path.split() + ["--version"]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
if result.returncode != 0:
raise Exception(f"Command failed: {result.stderr}")
except Exception as e:
platform_name = "macOS" if sys.platform == "darwin" else "Windows"
print(f"❌ Error: yt-dlp command failed: {yt_dlp_path}")
print(f"Please ensure yt-dlp is properly installed for {platform_name}")
print(f"Error: {e}")
sys.exit(1)
else:
# It's a file path, check if it exists
yt_dlp_file = Path(yt_dlp_path)
if not yt_dlp_file.exists():
platform_name = "macOS" if sys.platform == "darwin" else "Windows"
binary_name = yt_dlp_file.name
print(f"❌ Error: {binary_name} not found in downloader/ directory")
print(f"Please ensure {binary_name} is present in the downloader/ folder for {platform_name}")
print(f"Expected path: {yt_dlp_file}")
sys.exit(1)
downloader = KaraokeDownloader()
# Set parallel download options
@ -210,9 +392,19 @@ Examples:
if args.songlist_focus:
downloader.songlist_focus_titles = args.songlist_focus
downloader.songlist_only = True # Enable songlist-only mode when focusing
args.songlist_only = True # Also set the args flag to ensure CLI logic works
print(
f"🎯 Songlist focus mode enabled for playlists: {', '.join(args.songlist_focus)}"
)
if args.songlist_file:
downloader.songlist_file_path = args.songlist_file
print(f"📁 Using custom songlist file: {args.songlist_file}")
if args.force:
downloader.force_download = True
print("💪 Force mode enabled - will download regardless of existing files or server duplicates")
if args.dry_run:
downloader.dry_run = True
print("🔍 Dry run mode enabled - will show download plan without downloading")
if args.resolution != "720p":
downloader.config_manager.update_resolution(args.resolution)
@ -226,17 +418,16 @@ Examples:
sys.exit(0)
# --- END NEW ---
# --- NEW: If no URL or file is provided, but --songlist-only is set, use all channels in data/channels.txt ---
if args.songlist_only and not args.url and not args.file:
channels_file = Path("data/channels.txt")
if channels_file.exists():
args.file = str(channels_file)
# --- NEW: If no URL or file is provided, but --songlist-only is set, use all channels ---
if (args.songlist_only or args.songlist_focus) and not args.url and not args.file:
channel_urls = load_channels()
if channel_urls:
print(
"📋 No URL or --file provided, defaulting to all channels in data/channels.txt for songlist-only mode."
"📋 No URL or --file provided, defaulting to all configured channels for songlist mode."
)
else:
print(
"❌ No URL, --file, or data/channels.txt found. Please provide a channel URL or a file with channel URLs."
"❌ No URL, --file, or channel configuration found. Please provide a channel URL or create channels.json in the data directory."
)
sys.exit(1)
# --- END NEW ---
@ -256,6 +447,22 @@ Examples:
print(" Songs will be re-checked against the server on next run.")
sys.exit(0)
if args.generate_songlist:
from karaoke_downloader.songlist_generator import SongListGenerator
print("🎵 Generating song list from MP4 files with ID3 tags...")
generator = SongListGenerator()
try:
generator.generate_songlist_from_multiple_directories(
args.generate_songlist,
append=not args.no_append_songlist
)
print("✅ Song list generation completed successfully!")
except Exception as e:
print(f"❌ Error generating song list: {e}")
sys.exit(1)
sys.exit(0)
if args.status:
stats = downloader.tracker.get_statistics()
print("🎤 Karaoke Downloader Status")
@ -273,9 +480,10 @@ Examples:
print("💾 Channel Cache Information")
print("=" * 40)
print(f"Total Channels: {cache_info['total_channels']}")
print(f"Total Cached Videos: {cache_info['total_cached_videos']}")
print(f"Cache Duration: {cache_info['cache_duration_hours']} hours")
print(f"Last Updated: {cache_info['last_updated']}")
print(f"Total Cached Videos: {cache_info['total_videos']}")
print("\n📋 Channel Details:")
for channel in cache_info['channels']:
print(f"{channel['channel']}: {channel['videos']} videos (updated: {channel['last_updated']})")
sys.exit(0)
elif args.clear_cache:
if args.clear_cache == "all":
@ -315,47 +523,77 @@ Examples:
if len(tracking) > 10:
print(f" ... and {len(tracking) - 10} more")
sys.exit(0)
elif args.songlist_only or args.songlist_focus:
# Use provided file or default to data/channels.txt
channel_file = args.file if args.file else "data/channels.txt"
if not os.path.exists(channel_file):
print(f"❌ Channel file not found: {channel_file}")
elif args.manual:
# Download from manual videos collection
print("🎤 Downloading from manual videos collection...")
success = downloader.download_channel_videos(
"manual://static",
force_refresh=args.refresh,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
)
elif args.channel_focus:
# Download from a specific channel by name
print(f"🎤 Looking up channel: {args.channel_focus}")
channel_url = get_channel_url_by_name(args.channel_focus)
if not channel_url:
print(f"❌ Channel '{args.channel_focus}' not found in configuration")
print("Available channels:")
channel_urls = load_channels()
for url in channel_urls:
if "/@" in url:
channel_name = url.split("/@")[1].split("/")[0]
print(f"{channel_name}")
sys.exit(1)
if args.all_videos:
# Download ALL videos from the channel (not just songlist matches)
print(f"🎤 Downloading ALL videos from channel: {args.channel_focus} ({channel_url})")
success = downloader.download_all_channel_videos(
channel_url,
force_refresh=args.refresh,
force_download=args.force,
limit=args.limit,
dry_run=args.dry_run,
)
else:
# Download only songlist matches from the channel
print(f"🎤 Downloading from channel: {args.channel_focus} ({channel_url})")
success = downloader.download_channel_videos(
channel_url,
force_refresh=args.refresh,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
dry_run=args.dry_run,
)
elif args.songlist_only or args.songlist_focus:
# Use provided file or default to channels configuration
channel_urls = load_channels(args.file)
if not channel_urls:
print(f"❌ No channels found in configuration")
sys.exit(1)
with open(channel_file, "r", encoding="utf-8") as f:
channel_urls = [
line.strip()
for line in f
if line.strip() and not line.strip().startswith("#")
]
limit = args.limit if args.limit else None
force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
)
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = (
args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD
)
success = downloader.download_songlist_across_channels(
channel_urls,
limit=limit,
force_refresh_download_plan=force_refresh_download_plan,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
limit=args.limit,
force_refresh_download_plan=args.force_download_plan if hasattr(args, "force_download_plan") else False,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
show_pagination=args.show_pagination,
parallel_channels=args.parallel_channels,
max_channel_workers=args.channel_workers,
dry_run=args.dry_run,
)
elif args.latest_per_channel:
# Use provided file or default to data/channels.txt
channel_file = args.file if args.file else "data/channels.txt"
if not os.path.exists(channel_file):
print(f"❌ Channel file not found: {channel_file}")
# Use provided file or default to channels configuration
channel_urls = load_channels(args.file)
if not channel_urls:
print(f"No channels found in configuration")
sys.exit(1)
with open(channel_file, "r", encoding="utf-8") as f:
channel_urls = [
line.strip()
for line in f
if line.strip() and not line.strip().startswith("#")
]
limit = args.limit if args.limit else DEFAULT_LATEST_PER_CHANNEL_LIMIT
force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
@ -372,14 +610,156 @@ Examples:
force_refresh_download_plan=force_refresh_download_plan,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
force_download=args.force,
dry_run=args.dry_run,
)
elif args.url:
success = downloader.download_channel_videos(
args.url, force_refresh=args.refresh
args.url, force_refresh=args.refresh, dry_run=args.dry_run
)
else:
parser.print_help()
sys.exit(1)
# Default behavior: download from channels (equivalent to --latest-per-channel)
print("🎯 No specific mode specified, defaulting to download from channels")
channel_urls = load_channels(args.file)
if not channel_urls:
print(f"❌ No channels found in configuration")
print("Please provide a channel URL or create channels.json in the data directory")
sys.exit(1)
limit = args.limit if args.limit else DEFAULT_LATEST_PER_CHANNEL_LIMIT
force_refresh_download_plan = (
args.force_download_plan if hasattr(args, "force_download_plan") else False
)
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = (
args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD
)
success = downloader.download_latest_per_channel(
channel_urls,
limit=limit,
force_refresh_download_plan=force_refresh_download_plan,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
force_download=args.force,
dry_run=args.dry_run,
)
# Generate unmatched report if requested (additive feature)
if args.generate_unmatched_report:
from karaoke_downloader.download_planner import generate_unmatched_report, build_download_plan
from karaoke_downloader.songlist_manager import load_songlist
print("\n🔍 Generating unmatched songs report...")
# Load songlist based on focus mode
if args.songlist_focus:
# Load focused playlists
songlist_file_path = args.songlist_file if args.songlist_file else str(get_data_path_manager().get_songlist_path())
songlist_file = Path(songlist_file_path)
if not songlist_file.exists():
print(f"⚠️ Songlist file not found: {songlist_file_path}")
else:
try:
with open(songlist_file, "r", encoding="utf-8") as f:
raw_data = json.load(f)
# Filter playlists by title
focused_playlists = []
for playlist in raw_data:
playlist_title = playlist.get("title", "")
if playlist_title in args.songlist_focus:
focused_playlists.append(playlist)
if focused_playlists:
# Flatten the focused playlists into songs
focused_songs = []
seen = set()
for playlist in focused_playlists:
if "songs" in playlist:
for song in playlist["songs"]:
if "artist" in song and "title" in song:
artist = song["artist"].strip()
title = song["title"].strip()
key = f"{artist.lower()}_{title.lower()}"
if key in seen:
continue
seen.add(key)
focused_songs.append(
{
"artist": artist,
"title": title,
"position": song.get("position", 0),
}
)
songlist = focused_songs
else:
print(f"⚠️ No playlists found matching: {', '.join(args.songlist_focus)}")
songlist = []
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load songlist for report: {e}")
songlist = []
else:
# Load all songs from songlist
songlist_path = args.songlist_file if args.songlist_file else str(get_data_path_manager().get_songlist_path())
songlist = load_songlist(songlist_path)
if songlist:
# Load channel URLs
channel_file = args.file if args.file else str(get_data_path_manager().get_channels_txt_path())
if os.path.exists(channel_file):
with open(channel_file, "r", encoding='utf-8') as f:
channel_urls = [
line.strip()
for line in f
if line.strip() and not line.strip().startswith("#")
]
print(f"📋 Analyzing {len(songlist)} songs against {len(channel_urls)} channels...")
# Build download plan to get unmatched songs
fuzzy_match = args.fuzzy_match if hasattr(args, "fuzzy_match") else False
fuzzy_threshold = (
args.fuzzy_threshold
if hasattr(args, "fuzzy_threshold")
else DEFAULT_FUZZY_THRESHOLD
)
try:
download_plan, unmatched = build_download_plan(
channel_urls,
songlist,
downloader.tracker,
downloader.yt_dlp_path,
fuzzy_match=fuzzy_match,
fuzzy_threshold=fuzzy_threshold,
)
if unmatched:
report_file = generate_unmatched_report(unmatched)
print(f"\n📋 Unmatched songs report generated successfully!")
print(f"📁 Report saved to: {report_file}")
print(f"📊 Summary: {len(download_plan)} songs found, {len(unmatched)} songs not found")
print(f"\n🔍 First 10 unmatched songs:")
for i, song in enumerate(unmatched[:10], 1):
print(f" {i:2d}. {song['artist']} - {song['title']}")
if len(unmatched) > 10:
print(f" ... and {len(unmatched) - 10} more songs")
else:
print(f"\n✅ All {len(songlist)} songs were found in the channels!")
except Exception as e:
print(f"❌ Error generating report: {e}")
else:
print(f"❌ Channel file not found: {channel_file}")
else:
print("❌ No songlist available for report generation")
# Initialize success variable
success = False
downloader.tracker.force_save()
if success:
print("\n🎤 All downloads completed successfully!")

View File

@ -4,6 +4,8 @@ Provides centralized configuration loading, validation, and management.
"""
import json
import platform
import sys
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
@ -34,6 +36,7 @@ DEFAULT_CONFIG = {
"folder_structure": {
"downloads_dir": "downloads",
"logs_dir": "logs",
"data_dir": "data",
"tracking_file": "data/karaoke_tracking.json",
},
"logging": {
@ -42,6 +45,13 @@ DEFAULT_CONFIG = {
"include_console": True,
"include_file": True,
},
"platform_settings": {
"auto_detect_platform": True,
"yt_dlp_paths": {
"windows": "downloader/yt-dlp.exe",
"macos": "downloader/yt-dlp_macos"
}
},
"yt_dlp_path": "downloader/yt-dlp.exe",
}
@ -55,6 +65,23 @@ RESOLUTION_MAP = {
}
def detect_platform() -> str:
"""Detect the current platform and return platform name."""
system = platform.system().lower()
if system == "windows":
return "windows"
elif system == "darwin":
return "macos"
else:
return "windows" # Default to Windows for other platforms
def get_platform_yt_dlp_path(platform_paths: Dict[str, str]) -> str:
"""Get the appropriate yt-dlp path for the current platform."""
platform_name = detect_platform()
return platform_paths.get(platform_name, platform_paths.get("windows", "downloader/yt-dlp.exe"))
@dataclass
class DownloadSettings:
"""Configuration for download settings."""
@ -109,6 +136,7 @@ class FolderStructure:
downloads_dir: str = "downloads"
logs_dir: str = "logs"
data_dir: str = "data"
tracking_file: str = "data/karaoke_tracking.json"
@ -139,14 +167,21 @@ class ConfigManager:
Manages application configuration with loading, validation, and caching.
"""
def __init__(self, config_file: Union[str, Path] = "data/config.json"):
def __init__(self, config_file: Union[str, Path] = "config/config.json", data_dir: Optional[str] = None):
"""
Initialize the configuration manager.
Args:
config_file: Path to the configuration file
data_dir: Optional custom data directory path
"""
self.config_file = Path(config_file)
# If config_file is relative and data_dir is provided, make it relative to data_dir
if data_dir and not Path(config_file).is_absolute():
self.config_file = Path(data_dir) / config_file
else:
self.config_file = Path(config_file)
self._data_dir = data_dir
self._config: Optional[AppConfig] = None
self._last_modified: Optional[datetime] = None
@ -234,11 +269,21 @@ class ConfigManager:
folder_structure = FolderStructure(**config_data.get("folder_structure", {}))
logging_config = LoggingConfig(**config_data.get("logging", {}))
# Handle platform-specific yt-dlp path
yt_dlp_path = config_data.get("yt_dlp_path", "downloader/yt-dlp.exe")
# Check if platform auto-detection is enabled
platform_settings = config_data.get("platform_settings", {})
if platform_settings.get("auto_detect_platform", True):
platform_paths = platform_settings.get("yt_dlp_paths", {})
if platform_paths:
yt_dlp_path = get_platform_yt_dlp_path(platform_paths)
return AppConfig(
download_settings=download_settings,
folder_structure=folder_structure,
logging=logging_config,
yt_dlp_path=config_data.get("yt_dlp_path", "downloader/yt-dlp.exe"),
yt_dlp_path=yt_dlp_path,
_config_file=self.config_file,
)
@ -297,27 +342,35 @@ class ConfigManager:
_config_manager: Optional[ConfigManager] = None
def get_config_manager() -> ConfigManager:
def get_config_manager(config_file: Optional[Union[str, Path]] = None, data_dir: Optional[str] = None) -> ConfigManager:
"""
Get the global configuration manager instance.
Args:
config_file: Optional path to config file (default: "config.json" in root)
data_dir: Optional custom data directory path
Returns:
ConfigManager instance
"""
global _config_manager
if _config_manager is None:
_config_manager = ConfigManager()
if _config_manager is None or config_file is not None or data_dir is not None:
if config_file is None:
config_file = "config/config.json"
_config_manager = ConfigManager(config_file, data_dir)
return _config_manager
def load_config(force_reload: bool = False) -> AppConfig:
def load_config(force_reload: bool = False, config_file: Optional[Union[str, Path]] = None, data_dir: Optional[str] = None) -> AppConfig:
"""
Load configuration using the global manager.
Args:
force_reload: Force reload even if file hasn't changed
config_file: Optional path to config file (default: "config.json" in root)
data_dir: Optional custom data directory path
Returns:
AppConfig instance
"""
return get_config_manager().load_config(force_reload)
return get_config_manager(config_file, data_dir).load_config(force_reload)

View File

@ -0,0 +1,184 @@
"""
Data path management utilities for the karaoke downloader.
Provides centralized data directory path management and file path resolution.
"""
import os
from pathlib import Path
from typing import Optional
from .config_manager import get_config_manager
class DataPathManager:
"""
Manages data directory paths and provides utilities for resolving file paths
relative to the configured data directory.
"""
def __init__(self, data_dir: Optional[str] = None):
"""
Initialize the data path manager.
Args:
data_dir: Optional custom data directory path. If None, uses config.
"""
self._data_dir = data_dir
# If a custom data directory is provided, look for config.json in that directory
if data_dir:
config_file = Path(data_dir) / "config.json"
self._config_manager = get_config_manager(str(config_file))
else:
# Otherwise, use the default config.json in the root directory
self._config_manager = get_config_manager()
@property
def data_dir(self) -> Path:
"""
Get the configured data directory path.
Returns:
Path to the data directory
"""
if self._data_dir:
return Path(self._data_dir)
# Get from config
config = self._config_manager.get_config()
data_dir = getattr(config.folder_structure, 'data_dir', 'data')
return Path(data_dir)
def get_path(self, filename: str) -> Path:
"""
Get the full path to a file in the data directory.
Args:
filename: Name of the file (e.g., 'config.json', 'channels.json')
Returns:
Full path to the file
"""
return self.data_dir / filename
def get_channels_json_path(self) -> Path:
"""Get path to channels.json file."""
return self.get_path('channels.json')
def get_channels_txt_path(self) -> Path:
"""Get path to channels.txt file."""
return self.get_path('channels.txt')
def get_songlist_path(self) -> Path:
"""Get path to songList.json file."""
return self.get_path('songList.json')
def get_songlist_tracking_path(self) -> Path:
"""Get path to songlist_tracking.json file."""
return self.get_path('songlist_tracking.json')
def get_karaoke_tracking_path(self) -> Path:
"""Get path to karaoke_tracking.json file."""
return self.get_path('karaoke_tracking.json')
def get_server_duplicates_tracking_path(self) -> Path:
"""Get path to server_duplicates_tracking.json file."""
return self.get_path('server_duplicates_tracking.json')
def get_manual_videos_path(self) -> Path:
"""Get path to manual_videos.json file."""
return self.get_path('manual_videos.json')
def get_songs_path(self) -> Path:
"""Get path to songs.json file."""
return self.get_path('songs.json')
def get_channel_cache_dir(self) -> Path:
"""Get path to channel_cache directory."""
return self.get_path('channel_cache')
def get_channel_cache_path(self, channel_id: str) -> Path:
"""Get path to a specific channel cache file."""
return self.get_channel_cache_dir() / f"{channel_id}.json"
def get_download_plan_cache_path(self, plan_name: str, **kwargs) -> Path:
"""Get path to download plan cache file."""
# Create a hash from kwargs for unique cache files
import hashlib
if kwargs:
kwargs_str = str(sorted(kwargs.items()))
hash_suffix = hashlib.md5(kwargs_str.encode()).hexdigest()[:8]
plan_name = f"{plan_name}_{hash_suffix}"
return self.get_path(f"plan_latest_per_channel_{plan_name}.json")
def get_unmatched_report_path(self, timestamp: Optional[str] = None) -> Path:
"""Get path to unmatched songs report file."""
if timestamp:
return self.get_path(f"unmatched_songs_report_{timestamp}.json")
return self.get_path("unmatched_songs_report.json")
def ensure_data_dir_exists(self) -> None:
"""Ensure the data directory exists."""
self.data_dir.mkdir(parents=True, exist_ok=True)
def list_data_files(self) -> list:
"""List all files in the data directory."""
if not self.data_dir.exists():
return []
files = []
for file_path in self.data_dir.iterdir():
if file_path.is_file():
files.append(file_path.name)
return sorted(files)
def file_exists(self, filename: str) -> bool:
"""Check if a file exists in the data directory."""
return self.get_path(filename).exists()
# Global data path manager instance
_data_path_manager: Optional[DataPathManager] = None
def get_data_path_manager(data_dir: Optional[str] = None) -> DataPathManager:
"""
Get the global data path manager instance.
Args:
data_dir: Optional custom data directory path
Returns:
DataPathManager instance
"""
global _data_path_manager
if _data_path_manager is None or data_dir is not None:
_data_path_manager = DataPathManager(data_dir)
return _data_path_manager
def get_data_path(filename: str, data_dir: Optional[str] = None) -> Path:
"""
Get the full path to a file in the data directory.
Args:
filename: Name of the file
data_dir: Optional custom data directory path
Returns:
Full path to the file
"""
return get_data_path_manager(data_dir).get_path(filename)
def get_data_dir(data_dir: Optional[str] = None) -> Path:
"""
Get the configured data directory path.
Args:
data_dir: Optional custom data directory path
Returns:
Path to the data directory
"""
return get_data_path_manager(data_dir).data_dir

View File

@ -20,6 +20,12 @@ from karaoke_downloader.youtube_utils import (
execute_yt_dlp_command,
show_available_formats,
)
from karaoke_downloader.file_utils import (
cleanup_temp_files,
get_unique_filename,
is_valid_mp4_file,
sanitize_filename,
)
class DownloadPipeline:
@ -63,9 +69,15 @@ class DownloadPipeline:
True if successful, False otherwise
"""
try:
# Step 1: Prepare file path
filename = sanitize_filename(artist, title)
output_path = self.downloads_dir / channel_name / filename
# Step 1: Prepare file path and check for existing files
output_path, file_exists = get_unique_filename(self.downloads_dir, channel_name, artist, title)
if file_exists:
print(f"⏭️ Skipping download - file already exists: {output_path.name}")
# Still add tags and track the existing file
if self._add_tags(output_path, artist, title, channel_name):
self._track_download(output_path, artist, title, video_id, channel_name)
return True
# Step 2: Download video
if not self._download_video(video_id, output_path, artist, title, channel_name):
@ -214,8 +226,10 @@ class DownloadPipeline:
) -> bool:
"""Step 3: Add ID3 tags to the downloaded file."""
try:
# Use the same artist/title as the filename for consistency
# Don't add "(Karaoke Version)" to the ID3 tag title
add_id3_tags(
output_path, f"{artist} - {title} (Karaoke Version)", channel_name
output_path, f"{artist} - {title}", channel_name
)
print(f"🏷️ Added ID3 tags: {artist} - {title}")
return True
@ -283,9 +297,10 @@ class DownloadPipeline:
video_title = video.get("title", "")
# Extract artist and title from video title
from karaoke_downloader.id3_utils import extract_artist_title
from karaoke_downloader.channel_parser import ChannelParser
artist, title = extract_artist_title(video_title)
channel_parser = ChannelParser()
artist, title = channel_parser.extract_artist_title(video_title, channel_name)
print(f" ({i}/{total}) Processing: {artist} - {title}")

View File

@ -3,19 +3,31 @@ Download plan building utilities.
Handles pre-scanning channels and building download plans.
"""
import concurrent.futures
import hashlib
import json
import sys
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from karaoke_downloader.cache_manager import (
delete_plan_cache,
get_download_plan_cache_file,
load_cached_plan,
save_plan_cache,
)
# Import all fuzzy matching functions
from karaoke_downloader.fuzzy_matcher import (
create_song_key,
extract_artist_title,
create_video_key,
get_similarity_function,
is_exact_match,
is_fuzzy_match,
normalize_title,
)
from karaoke_downloader.channel_parser import ChannelParser
from karaoke_downloader.data_path_manager import get_data_path_manager
from karaoke_downloader.youtube_utils import get_channel_info
# Constants
@ -23,6 +35,156 @@ DEFAULT_FILENAME_LENGTH_LIMIT = 100
DEFAULT_ARTIST_LENGTH_LIMIT = 30
DEFAULT_TITLE_LENGTH_LIMIT = 60
DEFAULT_FUZZY_THRESHOLD = 85
DEFAULT_DISPLAY_LIMIT = 10
def generate_unmatched_report(unmatched: List[Dict[str, Any]], report_path: str = None) -> str:
"""
Generate a detailed report of unmatched songs and save it to a file.
Args:
unmatched: List of unmatched songs from build_download_plan
report_path: Optional path to save the report (default: data/unmatched_songs_report.json)
Returns:
Path to the saved report file
"""
if report_path is None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_path = str(get_data_path_manager().get_unmatched_report_path(timestamp))
report_data = {
"generated_at": datetime.now().isoformat(),
"total_unmatched": len(unmatched),
"unmatched_songs": []
}
for song in unmatched:
report_data["unmatched_songs"].append({
"artist": song["artist"],
"title": song["title"],
"position": song.get("position", 0),
"search_key": create_song_key(song["artist"], song["title"])
})
# Sort by artist, then by title for easier reading
report_data["unmatched_songs"].sort(key=lambda x: (x["artist"].lower(), x["title"].lower()))
# Ensure the data directory exists
report_file = Path(report_path)
report_file.parent.mkdir(parents=True, exist_ok=True)
# Save the report
with open(report_file, 'w', encoding='utf-8') as f:
json.dump(report_data, f, indent=2, ensure_ascii=False)
return str(report_file)
def _scan_channel_for_matches(
channel_url,
channel_name,
channel_id,
song_keys,
song_lookup,
fuzzy_match,
fuzzy_threshold,
show_pagination,
yt_dlp_path,
tracker,
):
"""
Scan a single channel for matches (used in parallel processing).
Args:
channel_url: URL of the channel to scan
channel_name: Name of the channel
channel_id: ID of the channel
song_keys: Set of song keys to match against
song_lookup: Dictionary mapping song keys to song data
fuzzy_match: Whether to use fuzzy matching
fuzzy_threshold: Threshold for fuzzy matching
show_pagination: Whether to show pagination progress
yt_dlp_path: Path to yt-dlp executable
tracker: Tracking manager instance
Returns:
List of video matches found in this channel
"""
print(f"\n🚦 Scanning channel: {channel_name} ({channel_url})")
# Get channel info if not provided
if not channel_name or not channel_id:
channel_name, channel_id = get_channel_info(channel_url)
# Fetch video list from channel
available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False, show_pagination=show_pagination
)
print(f" 📊 Channel has {len(available_videos)} videos to scan")
video_matches = []
# Pre-process video titles for efficient matching
channel_parser = ChannelParser()
if fuzzy_match:
# For fuzzy matching, create normalized video keys
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
# Find best match among remaining songs
best_match = None
best_score = 0
for song_key in song_keys:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
if best_match:
song = song_lookup[best_match]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": best_score,
}
)
# Remove matched song from future consideration
del song_lookup[best_match]
song_keys.remove(best_match)
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
if video_key in song_keys:
song = song_lookup[video_key]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": 100,
}
)
# Remove matched song from future consideration
del song_lookup[video_key]
song_keys.remove(video_key)
print(f" ✅ Found {len(video_matches)} matches in {channel_name}")
return video_matches
def build_download_plan(
@ -32,6 +194,9 @@ def build_download_plan(
yt_dlp_path,
fuzzy_match=False,
fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD,
show_pagination=False,
parallel_channels=False,
max_channel_workers=3,
):
"""
For each song in undownloaded, scan all channels for a match.
@ -52,85 +217,200 @@ def build_download_plan(
song_keys.add(key)
song_lookup[key] = song
for i, channel_url in enumerate(channel_urls, 1):
print(f"\n🚦 Starting channel {i}/{len(channel_urls)}: {channel_url}")
print(f" 🔍 Getting channel info...")
channel_name, channel_id = get_channel_info(channel_url)
print(f" ✅ Channel info: {channel_name} (ID: {channel_id})")
print(f" 🔍 Fetching video list from channel...")
available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False
)
print(
f" 📊 Channel has {len(available_videos)} videos to scan against {len(undownloaded)} songlist songs"
)
matches_this_channel = 0
video_matches = [] # Initialize video_matches for this channel
if parallel_channels:
print(f"🚀 Running parallel channel scanning with {max_channel_workers} workers.")
# Create a thread-safe copy of song data for parallel processing
import threading
song_keys_lock = threading.Lock()
song_lookup_lock = threading.Lock()
def scan_channel_safe(channel_url):
"""Thread-safe channel scanning function."""
print(f"\n🚦 Scanning channel: {channel_url}")
# Get channel info
channel_name, channel_id = get_channel_info(channel_url)
print(f" ✅ Channel info: {channel_name} (ID: {channel_id})")
# Fetch video list from channel
available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False, show_pagination=show_pagination
)
print(f" 📊 Channel has {len(available_videos)} videos to scan")
video_matches = []
# Pre-process video titles for efficient matching
channel_parser = ChannelParser()
if fuzzy_match:
# For fuzzy matching, create normalized video keys
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
# Pre-process video titles for efficient matching
if fuzzy_match:
# For fuzzy matching, create normalized video keys
for video in available_videos:
v_artist, v_title = extract_artist_title(video["title"])
video_key = create_song_key(v_artist, v_title)
# Find best match among remaining songs (thread-safe)
best_match = None
best_score = 0
with song_keys_lock:
available_song_keys = list(song_keys) # Copy for iteration
for song_key in available_song_keys:
with song_lookup_lock:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
# Find best match among remaining songs
best_match = None
best_score = 0
for song_key in song_keys:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
if best_match:
with song_lookup_lock:
if best_match in song_lookup: # Double-check it's still available
song = song_lookup[best_match]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": best_score,
}
)
# Remove matched song from future consideration
del song_lookup[best_match]
with song_keys_lock:
song_keys.discard(best_match)
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
if best_match:
song = song_lookup[best_match]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": best_score,
}
)
# Remove matched song from future consideration
del song_lookup[best_match]
song_keys.remove(best_match)
matches_this_channel += 1
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = extract_artist_title(video["title"])
video_key = create_song_key(v_artist, v_title)
with song_lookup_lock:
if video_key in song_keys and video_key in song_lookup:
song = song_lookup[video_key]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": 100,
}
)
# Remove matched song from future consideration
del song_lookup[video_key]
with song_keys_lock:
song_keys.discard(video_key)
print(f" ✅ Found {len(video_matches)} matches in {channel_name}")
return video_matches
# Execute parallel channel scanning
with concurrent.futures.ThreadPoolExecutor(max_workers=max_channel_workers) as executor:
# Submit all channel scanning tasks
future_to_channel = {
executor.submit(scan_channel_safe, channel_url): channel_url
for channel_url in channel_urls
}
# Process results as they complete
for future in concurrent.futures.as_completed(future_to_channel):
channel_url = future_to_channel[future]
try:
video_matches = future.result()
plan.extend(video_matches)
channel_name, _ = get_channel_info(channel_url)
channel_match_counts[channel_name] = len(video_matches)
except Exception as e:
print(f"⚠️ Error processing channel {channel_url}: {e}")
channel_name, _ = get_channel_info(channel_url)
channel_match_counts[channel_name] = 0
else:
for i, channel_url in enumerate(channel_urls, 1):
print(f"\n🚦 Starting channel {i}/{len(channel_urls)}: {channel_url}")
print(f" 🔍 Getting channel info...")
channel_name, channel_id = get_channel_info(channel_url)
print(f" ✅ Channel info: {channel_name} (ID: {channel_id})")
print(f" 🔍 Fetching video list from channel...")
available_videos = tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(yt_dlp_path), force_refresh=False, show_pagination=show_pagination
)
print(
f" 📊 Channel has {len(available_videos)} videos to scan against {len(undownloaded)} songlist songs"
)
matches_this_channel = 0
video_matches = [] # Initialize video_matches for this channel
if video_key in song_keys:
song = song_lookup[video_key]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": 100,
}
)
# Remove matched song from future consideration
del song_lookup[video_key]
song_keys.remove(video_key)
matches_this_channel += 1
# Pre-process video titles for efficient matching
channel_parser = ChannelParser()
if fuzzy_match:
# For fuzzy matching, create normalized video keys
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
# Add matches to plan
plan.extend(video_matches)
# Find best match among remaining songs
best_match = None
best_score = 0
for song_key in song_keys:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
# Print match count once per channel
channel_match_counts[channel_name] = matches_this_channel
print(f" → Found {matches_this_channel} songlist matches in this channel.")
if best_match:
song = song_lookup[best_match]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": best_score,
}
)
# Remove matched song from future consideration
del song_lookup[best_match]
song_keys.remove(best_match)
matches_this_channel += 1
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = channel_parser.extract_artist_title(video["title"], channel_name)
video_key = create_song_key(v_artist, v_title)
if video_key in song_keys:
song = song_lookup[video_key]
video_matches.append(
{
"artist": song["artist"],
"title": song["title"],
"channel_name": channel_name,
"channel_url": channel_url,
"video_id": video["id"],
"video_title": video["title"],
"match_score": 100,
}
)
# Remove matched song from future consideration
del song_lookup[video_key]
song_keys.remove(video_key)
matches_this_channel += 1
# Add matches to plan
plan.extend(video_matches)
# Print match count once per channel
channel_match_counts[channel_name] = matches_this_channel
print(f" → Found {matches_this_channel} songlist matches in this channel.")
# Remaining unmatched songs
unmatched = list(song_lookup.values())
@ -143,4 +423,13 @@ def build_download_plan(
f" TOTAL: {sum(channel_match_counts.values())} matches across {len(channel_match_counts)} channels."
)
# Generate unmatched songs report if there are any
if unmatched:
try:
report_file = generate_unmatched_report(unmatched)
print(f"\n📋 Unmatched songs report saved to: {report_file}")
print(f"📋 Total unmatched songs: {len(unmatched)}")
except Exception as e:
print(f"⚠️ Could not generate unmatched songs report: {e}")
return plan, unmatched

File diff suppressed because it is too large Load Diff

View File

@ -34,7 +34,6 @@ def sanitize_filename(
# Clean up title
safe_title = (
title.replace("(From ", "")
.replace(")", "")
.replace(" - ", " ")
.replace(":", "")
)
@ -54,12 +53,19 @@ def sanitize_filename(
)
safe_artist = safe_artist.strip()
# Create filename
filename = f"{safe_artist} - {safe_title}.mp4"
# Create filename - handle empty artist case
if not safe_artist or safe_artist.strip() == "":
# If no artist, just use the title
filename = f"{safe_title}.mp4"
else:
filename = f"{safe_artist} - {safe_title}.mp4"
# Limit filename length if needed
if len(filename) > max_length:
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
if not safe_artist or safe_artist.strip() == "":
filename = f"{safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
else:
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
return filename
@ -81,11 +87,19 @@ def generate_possible_filenames(
safe_title = sanitize_title_for_filenames(title)
safe_artist = artist.replace("'", "").replace('"', "").strip()
return [
f"{safe_artist} - {safe_title}.mp4", # Songlist mode
f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
f"{safe_artist} - {safe_title} (Karaoke Version).mp4", # Channel videos mode
]
# Handle empty artist case
if not safe_artist or safe_artist.strip() == "":
return [
f"{safe_title}.mp4", # Songlist mode (no artist)
f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
f"{safe_title} (Karaoke Version).mp4", # Channel videos mode (no artist)
]
else:
return [
f"{safe_artist} - {safe_title}.mp4", # Songlist mode
f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
f"{safe_artist} - {safe_title} (Karaoke Version).mp4", # Channel videos mode
]
def sanitize_title_for_filenames(title: str) -> str:
@ -112,6 +126,7 @@ def check_file_exists_with_patterns(
) -> Tuple[bool, Optional[Path]]:
"""
Check if a file exists using multiple possible filename patterns.
Also checks for files with (2), (3), etc. suffixes that yt-dlp might create.
Args:
downloads_dir: Base downloads directory
@ -130,15 +145,56 @@ def check_file_exists_with_patterns(
# Apply length limits if needed
safe_artist = artist.replace("'", "").replace('"', "").strip()
safe_title = sanitize_title_for_filenames(title)
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
if not safe_artist or safe_artist.strip() == "":
filename = f"{safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
else:
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
# Check for exact filename match
file_path = channel_dir / filename
if file_path.exists() and file_path.stat().st_size > 0:
return True, file_path
# Check for files with (2), (3), etc. suffixes
base_name = filename.replace(".mp4", "")
for suffix in range(2, 10): # Check up to (9)
suffixed_filename = f"{base_name} ({suffix}).mp4"
suffixed_path = channel_dir / suffixed_filename
if suffixed_path.exists() and suffixed_path.stat().st_size > 0:
return True, suffixed_path
return False, None
def get_unique_filename(
downloads_dir: Path, channel_name: str, artist: str, title: str
) -> Tuple[Path, bool]:
"""
Get a unique filename for download, checking for existing files including duplicates.
Args:
downloads_dir: Base downloads directory
channel_name: Channel name
artist: Song artist
title: Song title
Returns:
Tuple of (file_path, is_existing) where is_existing indicates if a file already exists
"""
filename = sanitize_filename(artist, title)
channel_dir = downloads_dir / channel_name
file_path = channel_dir / filename
# Check if file already exists
exists, existing_path = check_file_exists_with_patterns(downloads_dir, channel_name, artist, title)
if exists and existing_path:
print(f"📁 File already exists: {existing_path.name}")
return existing_path, True
return file_path, False
def ensure_directory_exists(directory: Path) -> None:
"""
Ensure a directory exists, creating it if necessary.

View File

@ -32,10 +32,72 @@ def normalize_title(title):
def extract_artist_title(video_title):
"""Extract artist and title from video title."""
"""
Extract artist and title from video title.
This function handles multiple common video title formats found on YouTube karaoke channels:
1. "Artist - Title" format: "38 Special - Hold On Loosely"
2. "Title Karaoke | Artist Karaoke Version" format: "Hold On Loosely Karaoke | 38 Special Karaoke Version"
3. "Title Artist KARAOKE" format: "Hold On Loosely 38 Special KARAOKE"
Args:
video_title (str): The YouTube video title to parse
Returns:
tuple: (artist, title) where artist and title are strings. If parsing fails,
artist will be empty string and title will be the full video title.
Examples:
>>> extract_artist_title("38 Special - Hold On Loosely")
("38 Special", "Hold On Loosely")
>>> extract_artist_title("Hold On Loosely Karaoke | 38 Special Karaoke Version")
("38 Special", "Hold On Loosely")
>>> extract_artist_title("Unknown Format Video Title")
("", "Unknown Format Video Title")
"""
# Handle "Artist - Title" format
if " - " in video_title:
parts = video_title.split(" - ", 1)
return parts[0].strip(), parts[1].strip()
# Handle "Title Karaoke | Artist Karaoke Version" format
if " | " in video_title and "karaoke" in video_title.lower():
parts = video_title.split(" | ", 1)
title_part = parts[0].strip()
artist_part = parts[1].strip()
# Clean up the parts
title = title_part.replace("Karaoke", "").strip()
artist = artist_part.replace("Karaoke Version", "").strip()
return artist, title
# Handle "Title Artist KARAOKE" format
if "karaoke" in video_title.lower():
# Try to find the artist by looking for common patterns
title_lower = video_title.lower()
# Look for patterns like "Title Artist KARAOKE"
# This is a simplified approach - we'll need to improve this
words = video_title.split()
if len(words) >= 3:
# Assume the last word before "KARAOKE" is part of the artist
for i, word in enumerate(words):
if "karaoke" in word.lower():
if i >= 2:
# Everything before the last word before KARAOKE is title
# Everything after is artist
title = " ".join(words[:i-1])
artist = " ".join(words[i-1:])
return artist, title
# If we can't parse it, return empty artist and full title
return "", video_title
# Default: return empty artist and full title
return "", video_title

View File

@ -7,17 +7,33 @@ except ImportError:
MUTAGEN_AVAILABLE = False
def extract_artist_title(video_title):
title = (
video_title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip()
)
if " - " in title:
parts = title.split(" - ", 1)
if len(parts) == 2:
artist = parts[0].strip()
song_title = parts[1].strip()
return artist, song_title
return "Unknown Artist", title
def clean_channel_name(channel_name: str) -> str:
"""
Clean channel name for ID3 tagging by removing @ symbol and ensuring it's alpha-only.
Args:
channel_name: Raw channel name (may contain @ symbol)
Returns:
Cleaned channel name suitable for ID3 tags
"""
# Remove @ symbol if present
if channel_name.startswith('@'):
channel_name = channel_name[1:]
# Remove any non-alphanumeric characters and convert to single word
# Keep only letters, numbers, and spaces, then take the first word
cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', channel_name)
words = cleaned.split()
if words:
return words[0] # Return only the first word
return "Unknown"
# Import the enhanced extract_artist_title function from fuzzy_matcher.py
# This ensures consistent parsing across all modules and supports multiple video title formats
from karaoke_downloader.fuzzy_matcher import extract_artist_title
def add_id3_tags(file_path, video_title, channel_name):
@ -26,12 +42,13 @@ def add_id3_tags(file_path, video_title, channel_name):
return
try:
artist, title = extract_artist_title(video_title)
clean_channel = clean_channel_name(channel_name)
mp4 = MP4(str(file_path))
mp4["\xa9nam"] = title
mp4["\xa9ART"] = artist
mp4["\xa9alb"] = f"{channel_name} Karaoke"
mp4["\xa9alb"] = clean_channel # Use clean channel name only, no suffix
mp4["\xa9gen"] = "Karaoke"
mp4.save()
print(f"📝 Added ID3 tags: Artist='{artist}', Title='{title}'")
print(f"📝 Added ID3 tags: Artist='{artist}', Title='{title}', Album='{clean_channel}'")
except Exception as e:
print(f"⚠️ Could not add ID3 tags: {e}")

View File

@ -0,0 +1,83 @@
"""
Manual video manager for handling static video collections.
"""
import json
from pathlib import Path
from typing import Dict, List, Optional, Any
from karaoke_downloader.data_path_manager import get_data_path_manager
def load_manual_videos(manual_file: str = None) -> List[Dict[str, Any]]:
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""
Load manual videos from the JSON file.
Args:
manual_file: Path to manual videos JSON file
Returns:
List of video dictionaries
"""
manual_path = Path(manual_file)
if not manual_path.exists():
print(f"⚠️ Manual videos file not found: {manual_file}")
return []
try:
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
videos = data.get("videos", [])
print(f"📋 Loaded {len(videos)} manual videos from {manual_file}")
return videos
except Exception as e:
print(f"❌ Error loading manual videos: {e}")
return []
def get_manual_videos_for_channel(channel_name: str, manual_file: str = None) -> List[Dict[str, Any]]:
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""
Get manual videos for a specific channel.
Args:
channel_name: Channel name (should be "@ManualVideos")
manual_file: Path to manual videos JSON file
Returns:
List of video dictionaries
"""
if channel_name != "@ManualVideos":
return []
return load_manual_videos(manual_file)
def is_manual_channel(channel_url: str) -> bool:
"""
Check if a channel URL is a manual channel.
Args:
channel_url: Channel URL
Returns:
True if it's a manual channel
"""
return channel_url == "manual://static"
def get_manual_channel_info(channel_url: str) -> tuple[str, str]:
"""
Get channel info for manual channels.
Args:
channel_url: Channel URL
Returns:
Tuple of (channel_name, channel_id)
"""
if channel_url == "manual://static":
return "@ManualVideos", "manual"
return None, None

View File

@ -7,28 +7,40 @@ import json
from datetime import datetime
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
def load_server_songs(songs_path="data/songs.json"):
"""Load the list of songs already available on the server."""
def load_server_songs(songs_path=None):
if songs_path is None:
songs_path = str(get_data_path_manager().get_songs_path())
"""Load the list of songs already available on the server with format information."""
songs_file = Path(songs_path)
if not songs_file.exists():
print(f"⚠️ Server songs file not found: {songs_path}")
return set()
return {}
try:
with open(songs_file, "r", encoding="utf-8") as f:
data = json.load(f)
server_songs = set()
server_songs = {}
for song in data:
if "artist" in song and "title" in song:
if "artist" in song and "title" in song and "path" in song:
artist = song["artist"].strip()
title = song["title"].strip()
path = song["path"].strip()
key = f"{artist.lower()}_{normalize_title(title)}"
server_songs.add(key)
server_songs[key] = {
"artist": artist,
"title": title,
"path": path,
"is_mp3": path.lower().endswith('.mp3'),
"is_cdg": 'cdg' in path.lower(),
"is_mp4": path.lower().endswith('.mp4')
}
print(f"📋 Loaded {len(server_songs)} songs from server (songs.json)")
return server_songs
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load server songs: {e}")
return set()
return {}
def is_song_on_server(server_songs, artist, title):
@ -37,9 +49,24 @@ def is_song_on_server(server_songs, artist, title):
return key in server_songs
def should_skip_server_song(server_songs, artist, title):
"""Check if a song should be skipped because it's already available as MP4 on server.
Returns True if the song should be skipped (MP4 format), False if it should be downloaded (MP3/CDG format)."""
key = f"{artist.lower()}_{normalize_title(title)}"
if key not in server_songs:
return False # Not on server, so don't skip
song_info = server_songs[key]
# Skip if it's an MP4 file (video format)
# Don't skip if it's MP3 or in CDG folder (different format)
return song_info.get("is_mp4", False) and not song_info.get("is_cdg", False)
def load_server_duplicates_tracking(
tracking_path="data/server_duplicates_tracking.json",
tracking_path=None,
):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_server_duplicates_tracking_path())
"""Load the tracking of songs found to be duplicates on the server."""
tracking_file = Path(tracking_path)
if not tracking_file.exists():
@ -53,8 +80,10 @@ def load_server_duplicates_tracking(
def save_server_duplicates_tracking(
tracking, tracking_path="data/server_duplicates_tracking.json"
tracking, tracking_path=None
):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_server_duplicates_tracking_path())
"""Save the tracking of songs found to be duplicates on the server."""
try:
with open(tracking_path, "w", encoding="utf-8") as f:
@ -86,8 +115,9 @@ def mark_song_as_server_duplicate(tracking, artist, title, video_title, channel_
def check_and_mark_server_duplicate(
server_songs, server_duplicates_tracking, artist, title, video_title, channel_name
):
"""Check if a song is on server and mark it as duplicate if so. Returns True if it's a duplicate."""
if is_song_on_server(server_songs, artist, title):
"""Check if a song should be skipped because it's already available as MP4 on server and mark it as duplicate if so.
Returns True if it should be skipped (MP4 format), False if it should be downloaded (MP3/CDG format)."""
if should_skip_server_song(server_songs, artist, title):
if not is_song_marked_as_server_duplicate(
server_duplicates_tracking, artist, title
):

View File

@ -35,6 +35,7 @@ class SongValidator:
video_title: Optional[str] = None,
server_songs: Optional[Dict[str, Any]] = None,
server_duplicates_tracking: Optional[Dict[str, Any]] = None,
force_download: bool = False,
) -> Tuple[bool, Optional[str], int]:
"""
Check if a song should be skipped based on multiple criteria.
@ -53,10 +54,15 @@ class SongValidator:
video_title: YouTube video title (optional)
server_songs: Server songs data (optional)
server_duplicates_tracking: Server duplicates tracking (optional)
force_download: If True, bypass all validation checks and force download
Returns:
Tuple of (should_skip, reason, total_filtered)
"""
# If force download is enabled, skip all validation checks
if force_download:
return False, None, 0
total_filtered = 0
# Check 1: Already downloaded by this system

View File

@ -0,0 +1,265 @@
import json
import os
from pathlib import Path
from typing import List, Dict, Any, Optional
from mutagen.mp4 import MP4
from karaoke_downloader.data_path_manager import get_data_path_manager
class SongListGenerator:
"""Utility class for generating song lists from MP4 files with ID3 tags."""
def __init__(self, songlist_path: str = None):
if songlist_path is None:
songlist_path = str(get_data_path_manager().get_songlist_path())
self.songlist_path = Path(songlist_path)
self.songlist_path.parent.mkdir(parents=True, exist_ok=True)
def read_existing_songlist(self) -> List[Dict[str, Any]]:
"""Read existing song list from JSON file."""
if self.songlist_path.exists():
try:
with open(self.songlist_path, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, IOError) as e:
print(f"⚠️ Warning: Could not read existing songlist: {e}")
return []
return []
def save_songlist(self, songlist: List[Dict[str, Any]]) -> None:
"""Save song list to JSON file."""
try:
with open(self.songlist_path, 'w', encoding='utf-8') as f:
json.dump(songlist, f, indent=2, ensure_ascii=False)
print(f"✅ Song list saved to {self.songlist_path}")
except IOError as e:
print(f"❌ Error saving song list: {e}")
raise
def extract_id3_tags(self, mp4_path: Path) -> Optional[Dict[str, str]]:
"""Extract ID3 tags from MP4 file."""
try:
mp4 = MP4(str(mp4_path))
# Extract artist and title from ID3 tags
artist = mp4.get("\xa9ART", ["Unknown Artist"])[0] if "\xa9ART" in mp4 else "Unknown Artist"
title = mp4.get("\xa9nam", ["Unknown Title"])[0] if "\xa9nam" in mp4 else "Unknown Title"
return {
"artist": artist,
"title": title
}
except Exception as e:
print(f"⚠️ Warning: Could not extract ID3 tags from {mp4_path.name}: {e}")
return None
def scan_directory_for_mp4_files(self, directory_path: str) -> List[Path]:
"""Scan directory for MP4 files."""
directory = Path(directory_path)
if not directory.exists():
raise FileNotFoundError(f"Directory not found: {directory_path}")
if not directory.is_dir():
raise ValueError(f"Path is not a directory: {directory_path}")
mp4_files = list(directory.glob("*.mp4"))
if not mp4_files:
print(f"⚠️ No MP4 files found in {directory_path}")
return []
print(f"📁 Found {len(mp4_files)} MP4 files in {directory.name}")
return sorted(mp4_files)
def generate_songlist_from_directory(self, directory_path: str, append: bool = True) -> Dict[str, Any]:
"""Generate a song list from MP4 files in a directory."""
directory = Path(directory_path)
directory_name = directory.name
# Scan for MP4 files
mp4_files = self.scan_directory_for_mp4_files(directory_path)
if not mp4_files:
return {}
# Extract ID3 tags and create songs list
songs = []
for index, mp4_file in enumerate(mp4_files, start=1):
id3_tags = self.extract_id3_tags(mp4_file)
if id3_tags:
song = {
"position": index,
"title": id3_tags["title"],
"artist": id3_tags["artist"]
}
songs.append(song)
print(f" {index:3d}. {id3_tags['artist']} - {id3_tags['title']}")
if not songs:
print("❌ No valid ID3 tags found in any MP4 files")
return {}
# Create the song list entry
songlist_entry = {
"title": directory_name,
"songs": songs
}
# Handle appending to existing song list
if append:
existing_songlist = self.read_existing_songlist()
# Check if a playlist with this title already exists
existing_index = None
for i, entry in enumerate(existing_songlist):
if entry.get("title") == directory_name:
existing_index = i
break
if existing_index is not None:
# Replace existing entry
print(f"🔄 Replacing existing playlist: {directory_name}")
existing_songlist[existing_index] = songlist_entry
else:
# Add new entry to the beginning of the list
print(f" Adding new playlist: {directory_name}")
existing_songlist.insert(0, songlist_entry)
self.save_songlist(existing_songlist)
else:
# Create new song list with just this entry
print(f"📝 Creating new song list with playlist: {directory_name}")
self.save_songlist([songlist_entry])
return songlist_entry
def generate_songlist_from_multiple_directories(self, directory_paths: List[str], append: bool = True) -> List[Dict[str, Any]]:
"""Generate song lists from multiple directories."""
results = []
errors = []
# Read existing song list once at the beginning
existing_songlist = self.read_existing_songlist() if append else []
for directory_path in directory_paths:
try:
print(f"\n📂 Processing directory: {directory_path}")
directory = Path(directory_path)
directory_name = directory.name
# Scan for MP4 files
mp4_files = self.scan_directory_for_mp4_files(directory_path)
if not mp4_files:
continue
# Extract ID3 tags and create songs list
songs = []
for index, mp4_file in enumerate(mp4_files, start=1):
id3_tags = self.extract_id3_tags(mp4_file)
if id3_tags:
song = {
"position": index,
"title": id3_tags["title"],
"artist": id3_tags["artist"]
}
songs.append(song)
print(f" {index:3d}. {id3_tags['artist']} - {id3_tags['title']}")
if not songs:
print("❌ No valid ID3 tags found in any MP4 files")
continue
# Create the song list entry
songlist_entry = {
"title": directory_name,
"songs": songs
}
# Check if a playlist with this title already exists
existing_index = None
for i, entry in enumerate(existing_songlist):
if entry.get("title") == directory_name:
existing_index = i
break
if existing_index is not None:
# Replace existing entry
print(f"🔄 Replacing existing playlist: {directory_name}")
existing_songlist[existing_index] = songlist_entry
else:
# Add new entry to the beginning of the list
print(f" Adding new playlist: {directory_name}")
existing_songlist.insert(0, songlist_entry)
results.append(songlist_entry)
except Exception as e:
error_msg = f"Error processing {directory_path}: {e}"
print(f"{error_msg}")
errors.append(error_msg)
# Save the final song list
if results:
if append:
# Save the updated existing song list
self.save_songlist(existing_songlist)
else:
# Create new song list with just the results
self.save_songlist(results)
# If there were any errors, raise an exception
if errors:
raise Exception(f"Failed to process {len(errors)} directories: {'; '.join(errors)}")
return results
def main():
"""CLI entry point for song list generation."""
import argparse
import sys
parser = argparse.ArgumentParser(
description="Generate song lists from MP4 files with ID3 tags",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python -m karaoke_downloader.songlist_generator /path/to/mp4/directory
python -m karaoke_downloader.songlist_generator /path/to/dir1 /path/to/dir2 --no-append
python -m karaoke_downloader.songlist_generator /path/to/dir --songlist-path custom_songlist.json
"""
)
parser.add_argument(
"directories",
nargs="+",
help="Directory paths containing MP4 files with ID3 tags"
)
parser.add_argument(
"--no-append",
action="store_true",
help="Create a new song list instead of appending to existing one"
)
parser.add_argument(
"--songlist-path",
default=None,
help="Path to the song list JSON file (default: songList.json in the data directory)"
)
args = parser.parse_args()
try:
generator = SongListGenerator(args.songlist_path)
generator.generate_songlist_from_multiple_directories(
args.directories,
append=not args.no_append
)
print("\n✅ Song list generation completed successfully!")
except Exception as e:
print(f"\n❌ Error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -7,6 +7,7 @@ import json
from datetime import datetime
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
from karaoke_downloader.server_manager import (
check_and_mark_server_duplicate,
is_song_marked_as_server_duplicate,
@ -16,7 +17,9 @@ from karaoke_downloader.server_manager import (
)
def load_songlist(songlist_path="data/songList.json"):
def load_songlist(songlist_path=None):
if songlist_path is None:
songlist_path = str(get_data_path_manager().get_songlist_path())
songlist_file = Path(songlist_path)
if not songlist_file.exists():
print(f"⚠️ Songlist file not found: {songlist_path}")
@ -55,7 +58,9 @@ def normalize_title(title):
return " ".join(normalized.split()).lower()
def load_songlist_tracking(tracking_path="data/songlist_tracking.json"):
def load_songlist_tracking(tracking_path=None):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
tracking_file = Path(tracking_path)
if not tracking_file.exists():
return {}
@ -67,7 +72,9 @@ def load_songlist_tracking(tracking_path="data/songlist_tracking.json"):
return {}
def save_songlist_tracking(tracking, tracking_path="data/songlist_tracking.json"):
def save_songlist_tracking(tracking, tracking_path=None):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
try:
with open(tracking_path, "w", encoding="utf-8") as f:
json.dump(tracking, f, indent=2, ensure_ascii=False)

View File

@ -1,10 +1,12 @@
import threading
from enum import Enum
import json
from datetime import datetime
import os
import re
from datetime import datetime, timedelta
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from karaoke_downloader.data_path_manager import get_data_path_manager
class SongStatus(str, Enum):
NOT_DOWNLOADED = "NOT_DOWNLOADED"
@ -25,46 +27,133 @@ class FormatType(str, Enum):
class TrackingManager:
def __init__(
self,
tracking_file="data/karaoke_tracking.json",
cache_file="data/channel_cache.json",
tracking_file=None,
cache_dir=None,
):
if tracking_file is None:
tracking_file = str(get_data_path_manager().get_karaoke_tracking_path())
if cache_dir is None:
cache_dir = str(get_data_path_manager().get_channel_cache_dir())
self.tracking_file = Path(tracking_file)
self.cache_file = Path(cache_file)
self.data = {"playlists": {}, "songs": {}}
self.cache = {}
self._lock = threading.Lock()
self._load()
self._load_cache()
self.cache_dir = Path(cache_dir)
# Ensure cache directory exists
self.cache_dir.mkdir(parents=True, exist_ok=True)
self.data = self._load()
print(f"📊 Tracking manager initialized with {len(self.data.get('songs', {}))} tracked songs")
def _load(self):
"""Load tracking data from JSON file."""
if self.tracking_file.exists():
try:
with open(self.tracking_file, "r", encoding="utf-8") as f:
self.data = json.load(f)
except Exception:
self.data = {"playlists": {}, "songs": {}}
return json.load(f)
except json.JSONDecodeError:
print(f"⚠️ Corrupted tracking file, creating new one")
return {"songs": {}, "playlists": {}, "last_updated": datetime.now().isoformat()}
def _save(self):
with self._lock:
with open(self.tracking_file, "w", encoding="utf-8") as f:
json.dump(self.data, f, indent=2, ensure_ascii=False)
"""Save tracking data to JSON file."""
self.data["last_updated"] = datetime.now().isoformat()
self.tracking_file.parent.mkdir(parents=True, exist_ok=True)
with open(self.tracking_file, "w", encoding="utf-8") as f:
json.dump(self.data, f, indent=2, ensure_ascii=False)
def force_save(self):
"""Force save the tracking data."""
self._save()
def _load_cache(self):
if self.cache_file.exists():
try:
with open(self.cache_file, "r", encoding="utf-8") as f:
self.cache = json.load(f)
except Exception:
self.cache = {}
def _get_channel_cache_file(self, channel_id: str) -> Path:
"""Get the cache file path for a specific channel."""
# Sanitize channel ID for filename
safe_channel_id = re.sub(r'[<>:"/\\|?*]', '_', channel_id)
return self.cache_dir / f"{safe_channel_id}.json"
def save_cache(self):
with open(self.cache_file, "w", encoding="utf-8") as f:
json.dump(self.cache, f, indent=2, ensure_ascii=False)
def _load_channel_cache(self, channel_id: str) -> List[Dict[str, str]]:
"""Load cache for a specific channel."""
cache_file = self._get_channel_cache_file(channel_id)
if cache_file.exists():
try:
with open(cache_file, 'r', encoding='utf-8') as f:
data = json.load(f)
return data.get('videos', [])
except (json.JSONDecodeError, KeyError):
print(f" ⚠️ Corrupted cache file for {channel_id}, will recreate")
return []
return []
def _save_channel_cache(self, channel_id: str, videos: List[Dict[str, str]]):
"""Save cache for a specific channel."""
cache_file = self._get_channel_cache_file(channel_id)
data = {
'channel_id': channel_id,
'videos': videos,
'last_updated': datetime.now().isoformat(),
'video_count': len(videos)
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
def _clear_channel_cache(self, channel_id: str):
"""Clear cache for a specific channel."""
cache_file = self._get_channel_cache_file(channel_id)
if cache_file.exists():
cache_file.unlink()
print(f" 🗑️ Cleared cache file: {cache_file.name}")
def get_cache_info(self):
"""Get information about all channel cache files."""
cache_files = list(self.cache_dir.glob("*.json"))
total_videos = 0
cache_info = []
for cache_file in cache_files:
try:
with open(cache_file, 'r', encoding='utf-8') as f:
data = json.load(f)
video_count = len(data.get('videos', []))
total_videos += video_count
last_updated = data.get('last_updated', 'Unknown')
cache_info.append({
'channel': data.get('channel_id', cache_file.stem),
'videos': video_count,
'last_updated': last_updated,
'file': cache_file.name
})
except Exception as e:
print(f"⚠️ Error reading cache file {cache_file.name}: {e}")
return {
'total_channels': len(cache_files),
'total_videos': total_videos,
'channels': cache_info
}
def clear_channel_cache(self, channel_id=None):
"""Clear cache for a specific channel or all channels."""
if channel_id:
self._clear_channel_cache(channel_id)
print(f"🗑️ Cleared cache for channel: {channel_id}")
else:
# Clear all cache files
cache_files = list(self.cache_dir.glob("*.json"))
for cache_file in cache_files:
cache_file.unlink()
print(f"🗑️ Cleared all {len(cache_files)} channel cache files")
def set_cache_duration(self, hours):
"""Placeholder for cache duration logic"""
pass
def export_playlist_report(self, playlist_id):
"""Export a report for a specific playlist."""
pass
def get_statistics(self):
"""Get statistics about tracked songs."""
total_songs = len(self.data["songs"])
downloaded_songs = sum(
1
@ -102,11 +191,13 @@ class TrackingManager:
}
def get_playlist_songs(self, playlist_id):
"""Get songs for a specific playlist."""
return [
s for s in self.data["songs"].values() if s["playlist_id"] == playlist_id
]
def get_failed_songs(self, playlist_id=None):
"""Get failed songs, optionally filtered by playlist."""
if playlist_id:
return [
s
@ -118,6 +209,7 @@ class TrackingManager:
]
def get_partial_downloads(self, playlist_id=None):
"""Get partial downloads, optionally filtered by playlist."""
if playlist_id:
return [
s
@ -129,7 +221,7 @@ class TrackingManager:
]
def cleanup_orphaned_files(self, downloads_dir):
# Remove tracking entries for files that no longer exist
"""Remove tracking entries for files that no longer exist."""
orphaned = []
for song_id, song in list(self.data["songs"].items()):
file_path = song.get("file_path")
@ -139,51 +231,17 @@ class TrackingManager:
self.force_save()
return orphaned
def get_cache_info(self):
total_channels = len(self.cache)
total_cached_videos = sum(len(v) for v in self.cache.values())
cache_duration_hours = 24 # default
last_updated = None
return {
"total_channels": total_channels,
"total_cached_videos": total_cached_videos,
"cache_duration_hours": cache_duration_hours,
"last_updated": last_updated,
}
def clear_channel_cache(self, channel_id=None):
if channel_id is None or channel_id == "all":
self.cache = {}
else:
self.cache.pop(channel_id, None)
self.save_cache()
def set_cache_duration(self, hours):
# Placeholder for cache duration logic
pass
def export_playlist_report(self, playlist_id):
playlist = self.data["playlists"].get(playlist_id)
if not playlist:
return f"Playlist '{playlist_id}' not found."
songs = self.get_playlist_songs(playlist_id)
report = {"playlist": playlist, "songs": songs}
return json.dumps(report, indent=2, ensure_ascii=False)
def is_song_downloaded(self, artist, title, channel_name=None, video_id=None):
"""
Check if a song has already been downloaded by this system.
Returns True if the song exists in tracking with DOWNLOADED or CONVERTED status.
Check if a song has already been downloaded.
Returns True if the song exists in tracking with DOWNLOADED status.
"""
# If we have video_id and channel_name, try direct key lookup first (most efficient)
if video_id and channel_name:
song_key = f"{video_id}@{channel_name}"
if song_key in self.data["songs"]:
song_data = self.data["songs"][song_key]
if song_data.get("status") in [
SongStatus.DOWNLOADED,
SongStatus.CONVERTED,
]:
if song_data.get("status") == SongStatus.DOWNLOADED:
return True
# Fallback to content search (for cases where we don't have video_id)
@ -191,19 +249,14 @@ class TrackingManager:
# Check if this song matches the artist and title
if song_data.get("artist") == artist and song_data.get("title") == title:
# Check if it's marked as downloaded
if song_data.get("status") in [
SongStatus.DOWNLOADED,
SongStatus.CONVERTED,
]:
if song_data.get("status") == SongStatus.DOWNLOADED:
return True
# Also check the video title field which might contain the song info
video_title = song_data.get("video_title", "")
if video_title and artist in video_title and title in video_title:
if song_data.get("status") in [
SongStatus.DOWNLOADED,
SongStatus.CONVERTED,
]:
if song_data.get("status") == SongStatus.DOWNLOADED:
return True
return False
def is_file_exists(self, file_path):
@ -283,65 +336,359 @@ class TrackingManager:
self._save()
def get_channel_video_list(
self, channel_url, yt_dlp_path="downloader/yt-dlp.exe", force_refresh=False
self, channel_url, yt_dlp_path="downloader/yt-dlp.exe", force_refresh=False, show_pagination=False
):
"""
Return a list of videos (dicts with 'title' and 'id') for the channel, using cache if available unless force_refresh is True.
Args:
channel_url: YouTube channel URL
yt_dlp_path: Path to yt-dlp executable
force_refresh: Force refresh cache even if available
show_pagination: Show page-by-page progress (slower but more detailed)
"""
channel_name, channel_id = None, None
# Check if this is a manual channel
from karaoke_downloader.manual_video_manager import is_manual_channel, get_manual_channel_info, get_manual_videos_for_channel
if is_manual_channel(channel_url):
channel_name, channel_id = get_manual_channel_info(channel_url)
if channel_name and channel_id:
print(f" 📋 Loading manual videos for {channel_name}")
manual_videos = get_manual_videos_for_channel(channel_name)
# Convert to the expected format
videos = []
for video in manual_videos:
videos.append({
"title": video.get("title", ""),
"id": video.get("id", ""),
"url": video.get("url", "")
})
print(f" ✅ Loaded {len(videos)} manual videos")
return videos
else:
print(f" ❌ Could not get manual channel info for: {channel_url}")
return []
# Regular YouTube channel processing
from karaoke_downloader.youtube_utils import get_channel_info
channel_name, channel_id = get_channel_info(channel_url)
if not channel_id:
print(f" ❌ Could not extract channel ID from URL: {channel_url}")
return []
# Try multiple possible cache keys
possible_keys = [
channel_id, # The extracted channel ID
channel_url, # The full URL
channel_name, # The extracted channel name
]
print(f" 🔍 Channel: {channel_name} (ID: {channel_id})")
cache_key = None
for key in possible_keys:
if key and key in self.cache:
cache_key = key
break
# Check if we have cached data for this channel
if not force_refresh:
cached_videos = self._load_channel_cache(channel_id)
if cached_videos:
# Validate that the cached data has proper video IDs
corrupted = False
# Check if any video IDs look like titles instead of proper YouTube IDs
for video in cached_videos[:20]: # Check first 20 videos
video_id = video.get("id", "")
# More comprehensive validation - YouTube IDs should be 11 characters and contain only alphanumeric, hyphens, and underscores
if video_id and (
len(video_id) != 11 or
not video_id.replace('-', '').replace('_', '').isalnum() or
" " in video_id or
"Lyrics" in video_id or
"KARAOKE" in video_id.upper() or
"Vocal" in video_id or
"Guide" in video_id
):
print(f" ⚠️ Detected corrupted video ID in cache: '{video_id}'")
corrupted = True
break
if corrupted:
print(f" 🧹 Clearing corrupted cache for {channel_id}")
self._clear_channel_cache(channel_id)
force_refresh = True
else:
print(f" 📋 Using cached video list ({len(cached_videos)} videos)")
return cached_videos
if not cache_key:
cache_key = channel_id or channel_url # Use as fallback for new entries
print(f" 🔍 Trying cache keys: {possible_keys}")
print(f" 🔍 Selected cache key: '{cache_key}'")
if not force_refresh and cache_key in self.cache:
print(
f" 📋 Using cached video list ({len(self.cache[cache_key])} videos)"
)
return self.cache[cache_key]
# Choose fetch method based on show_pagination flag
if show_pagination:
return self._fetch_videos_with_pagination(channel_url, channel_id, yt_dlp_path)
else:
print(f" ❌ Cache miss for all keys")
return self._fetch_videos_flat_playlist(channel_url, channel_id, yt_dlp_path)
def _fetch_videos_with_pagination(self, channel_url, channel_id, yt_dlp_path):
"""Fetch videos showing page-by-page progress."""
print(f" 🌐 Fetching video list from YouTube (page-by-page mode)...")
print(f" 📡 Channel URL: {channel_url}")
import subprocess
all_videos = []
page = 1
videos_per_page = 200 # YouTube/yt-dlp supports up to 200 videos per page, reducing API calls and errors
while True:
print(f" 📄 Fetching page {page}...")
# Fetch one page at a time
cmd = [
yt_dlp_path,
"--flat-playlist",
"--print",
"%(title)s|%(id)s|%(url)s",
"--playlist-start",
str((page - 1) * videos_per_page + 1),
"--playlist-end",
str(page * videos_per_page),
channel_url,
]
try:
# Increased timeout to 180 seconds for larger pages (200 videos)
result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=180)
lines = result.stdout.strip().splitlines()
# Save raw output for debugging (for each page)
raw_output_file = self._get_channel_cache_file(channel_id).parent / f"{channel_id}_raw_output_page{page}.txt"
try:
with open(raw_output_file, 'w', encoding='utf-8') as f:
f.write(f"# Raw yt-dlp output for {channel_id} - Page {page}\n")
f.write(f"# Channel URL: {channel_url}\n")
f.write(f"# Command: {' '.join(cmd)}\n")
f.write(f"# Timestamp: {datetime.now().isoformat()}\n")
f.write(f"# Total lines: {len(lines)}\n")
f.write("#" * 80 + "\n\n")
for i, line in enumerate(lines, 1):
f.write(f"{i:6d}: {line}\n")
print(f" 💾 Saved raw output to: {raw_output_file.name}")
except Exception as e:
print(f" ⚠️ Could not save raw output: {e}")
if not lines:
print(f" ✅ No more videos found on page {page}")
break
print(f" 📊 Page {page}: Found {len(lines)} videos")
page_videos = []
invalid_count = 0
for line in lines:
if not line.strip():
continue
# More robust parsing that handles titles with | characters
# Extract video ID directly from the URL that yt-dlp provides
# Find the URL and extract video ID from it
url_match = re.search(r'https://www\.youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})', line)
if not url_match:
continue
# Extract video ID directly from the URL
video_id = url_match.group(1)
# Extract title (everything before the video ID in the line)
title = line[:line.find(video_id)].rstrip('|').strip()
# Validate video ID
if video_id and (
len(video_id) == 11 and
video_id.replace('-', '').replace('_', '').isalnum() and
" " not in video_id and
"Lyrics" not in video_id and
"KARAOKE" not in video_id.upper() and
"Vocal" not in video_id and
"Guide" not in video_id
):
page_videos.append({"title": title, "id": video_id})
else:
invalid_count += 1
if invalid_count <= 3: # Show first 3 invalid IDs per page
print(f" ⚠️ Invalid ID: '{video_id}' for '{title[:50]}...'")
if invalid_count > 3:
print(f" ⚠️ ... and {invalid_count - 3} more invalid IDs on this page")
all_videos.extend(page_videos)
print(f" ✅ Page {page}: Added {len(page_videos)} valid videos (total: {len(all_videos)})")
# If we got fewer videos than expected, we're probably at the end
if len(lines) < videos_per_page:
print(f" 🏁 Reached end of channel (last page had {len(lines)} videos)")
break
page += 1
# Safety check to prevent infinite loops
if page > 50: # Max 50 pages (10,000 videos with 200 per page)
print(f" ⚠️ Reached maximum page limit (50 pages), stopping")
break
except subprocess.TimeoutExpired:
print(f" ⚠️ Page {page} timed out, stopping")
break
except subprocess.CalledProcessError as e:
print(f" ❌ Error fetching page {page}: {e}")
break
except KeyboardInterrupt:
print(f" ⏹️ User interrupted, stopping at page {page}")
break
if not all_videos:
print(f" ❌ No valid videos found")
return []
print(f" 🎉 Channel download complete!")
print(f" 📊 Total videos fetched: {len(all_videos)}")
# Save to individual channel cache file
self._save_channel_cache(channel_id, all_videos)
print(f" 💾 Saved cache to: {self._get_channel_cache_file(channel_id).name}")
return all_videos
def _fetch_videos_flat_playlist(self, channel_url, channel_id, yt_dlp_path):
"""Fetch all videos using flat playlist (faster but less detailed progress)."""
# Fetch with yt-dlp
print(f" 🌐 Fetching video list from YouTube (this may take a while)...")
print(f" 📡 Channel URL: {channel_url}")
import subprocess
from karaoke_downloader.youtube_utils import _parse_yt_dlp_command
cmd = [
yt_dlp_path,
# First, let's get the total count to show progress
count_cmd = _parse_yt_dlp_command(yt_dlp_path) + [
"--flat-playlist",
"--print",
"%(title)s",
"--playlist-end",
"1", # Just get first video to test
channel_url,
]
try:
print(f" 🔍 Testing channel access...")
test_result = subprocess.run(count_cmd, capture_output=True, text=True, timeout=30)
if test_result.returncode == 0:
print(f" ✅ Channel is accessible")
else:
print(f" ⚠️ Channel test failed: {test_result.stderr}")
except subprocess.TimeoutExpired:
print(f" ⚠️ Channel test timed out")
except Exception as e:
print(f" ⚠️ Channel test error: {e}")
# Now fetch all videos with progress indicators
cmd = _parse_yt_dlp_command(yt_dlp_path) + [
"--flat-playlist",
"--print",
"%(title)s|%(id)s|%(url)s",
"--verbose", # Add verbose output to see what's happening
channel_url,
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
print(f" 🔧 Running yt-dlp command: {' '.join(cmd)}")
print(f" 📥 Starting video list download...")
# Use a timeout and show progress
result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=300)
lines = result.stdout.strip().splitlines()
# Save raw output for debugging
raw_output_file = self._get_channel_cache_file(channel_id).parent / f"{channel_id}_raw_output.txt"
try:
with open(raw_output_file, 'w', encoding='utf-8') as f:
f.write(f"# Raw yt-dlp output for {channel_id}\n")
f.write(f"# Channel URL: {channel_url}\n")
f.write(f"# Command: {' '.join(cmd)}\n")
f.write(f"# Timestamp: {datetime.now().isoformat()}\n")
f.write(f"# Total lines: {len(lines)}\n")
f.write("#" * 80 + "\n\n")
for i, line in enumerate(lines, 1):
f.write(f"{i:6d}: {line}\n")
print(f" 💾 Saved raw output to: {raw_output_file.name}")
except Exception as e:
print(f" ⚠️ Could not save raw output: {e}")
print(f" 📄 Raw output lines: {len(lines)}")
print(f" 📊 Download completed successfully!")
# Show some sample lines to understand the format
if lines:
print(f" 📋 Sample output format:")
for i, line in enumerate(lines[:3]):
print(f" Line {i+1}: {line[:100]}...")
if len(lines) > 3:
print(f" ... and {len(lines) - 3} more lines")
videos = []
for line in lines:
parts = line.split("|")
if len(parts) >= 2:
title, video_id = parts[0].strip(), parts[1].strip()
invalid_count = 0
print(f" 🔍 Processing {len(lines)} video entries...")
for i, line in enumerate(lines):
if i % 1000 == 0 and i > 0: # Progress indicator every 1000 lines
print(f" 📊 Processing line {i}/{len(lines)}... ({i/len(lines)*100:.1f}%)")
# More robust parsing that handles titles with | characters
# Extract video ID directly from the URL that yt-dlp provides
# Find the URL and extract video ID from it
url_match = re.search(r'https://www\.youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})', line)
if not url_match:
invalid_count += 1
if invalid_count <= 5:
print(f" ⚠️ Skipping line with no URL: '{line[:100]}...'")
elif invalid_count == 6:
print(f" ⚠️ ... and {len(lines) - i - 1} more invalid lines")
continue
# Extract video ID directly from the URL
video_id = url_match.group(1)
# Extract title (everything before the video ID in the line)
title = line[:line.find(video_id)].rstrip('|').strip()
# Validate video ID
if video_id and (
len(video_id) == 11 and
video_id.replace('-', '').replace('_', '').isalnum() and
" " not in video_id and
"Lyrics" not in video_id and
"KARAOKE" not in video_id.upper() and
"Vocal" not in video_id and
"Guide" not in video_id
):
videos.append({"title": title, "id": video_id})
self.cache[cache_key] = videos
self.save_cache()
else:
invalid_count += 1
if invalid_count <= 5: # Only show first 5 invalid IDs
print(f" ⚠️ Skipping invalid video ID: '{video_id}' for title: '{title[:50]}...'")
elif invalid_count == 6:
print(f" ⚠️ ... and {len(lines) - i - 1} more invalid IDs")
if not videos:
print(f" ❌ No valid videos found after parsing")
return []
print(f" ✅ Parsed {len(videos)} valid videos from YouTube")
print(f" ⚠️ Skipped {invalid_count} invalid video IDs")
# Save to individual channel cache file
self._save_channel_cache(channel_id, videos)
print(f" 💾 Saved cache to: {self._get_channel_cache_file(channel_id).name}")
return videos
except subprocess.TimeoutExpired:
print(f"❌ yt-dlp timed out after 5 minutes - channel may be too large")
return []
except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed to fetch playlist for cache: {e}")
print(f" 📄 stderr: {e.stderr}")
return []

View File

@ -106,6 +106,10 @@ def download_single_video(
print(f"⬇️ Downloading: {artist} - {title} -> {output_path}")
video_url = f"https://www.youtube.com/watch?v={video_id}"
# Debug: Show the video_id and URL being used
print(f"🔍 DEBUG: video_id = '{video_id}'")
print(f"🔍 DEBUG: video_url = '{video_url}'")
# Build command using centralized utility
cmd = build_yt_dlp_command(yt_dlp_path, video_url, output_path, config)
@ -255,7 +259,7 @@ def execute_download_plan(
video_id = item["video_id"]
video_title = item["video_title"]
print(f"\n⬇️ Downloading {len(download_plan) - idx} of {total_to_download}:")
print(f"\n⬇️ Downloading {downloaded_count + 1} of {total_to_download}:")
print(f" 📋 Songlist: {artist} - {title}")
print(f" 🎬 Video: {video_title} ({channel_name})")
if "match_score" in item:

View File

@ -9,6 +9,19 @@ from typing import Any, Dict, List, Optional, Union
from karaoke_downloader.config_manager import AppConfig
def _parse_yt_dlp_command(yt_dlp_path: str) -> List[str]:
"""
Parse yt-dlp path/command into a list of command arguments.
Handles both file paths and command strings like 'python3 -m yt_dlp'.
"""
if yt_dlp_path.startswith(('python', 'python3')):
# It's a Python module command
return yt_dlp_path.split()
else:
# It's a file path
return [yt_dlp_path]
def get_channel_info(
channel_url: str, yt_dlp_path: str = "downloader/yt-dlp.exe"
) -> tuple[str, str]:
@ -43,7 +56,7 @@ def get_playlist_info(
) -> List[Dict[str, Any]]:
"""Get playlist information using yt-dlp."""
try:
cmd = [yt_dlp_path, "--dump-json", "--flat-playlist", playlist_url]
cmd = _parse_yt_dlp_command(yt_dlp_path) + ["--dump-json", "--flat-playlist", playlist_url]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
videos = []
for line in result.stdout.strip().split("\n"):
@ -75,8 +88,7 @@ def build_yt_dlp_command(
Returns:
List of command arguments for subprocess.run
"""
cmd = [
str(yt_dlp_path),
cmd = _parse_yt_dlp_command(yt_dlp_path) + [
"--no-check-certificates",
"--ignore-errors",
"--no-warnings",
@ -128,7 +140,7 @@ def show_available_formats(
timeout: Timeout in seconds
"""
print(f"🔍 Checking available formats for: {video_url}")
format_cmd = [str(yt_dlp_path), "--list-formats", video_url]
format_cmd = _parse_yt_dlp_command(yt_dlp_path) + ["--list-formats", video_url]
try:
format_result = subprocess.run(
format_cmd, capture_output=True, text=True, timeout=timeout

220
setup_macos.py Normal file
View File

@ -0,0 +1,220 @@
#!/usr/bin/env python3
"""
macOS setup script for Karaoke Video Downloader.
This script helps users set up yt-dlp and FFmpeg on macOS.
"""
import os
import sys
import subprocess
from pathlib import Path
def check_ffmpeg():
"""Check if FFmpeg is installed."""
try:
result = subprocess.run(["ffmpeg", "-version"], capture_output=True, text=True, timeout=10)
return result.returncode == 0
except (subprocess.TimeoutExpired, FileNotFoundError):
return False
def check_yt_dlp():
"""Check if yt-dlp is installed via pip or binary."""
# Check pip installation
try:
result = subprocess.run([sys.executable, "-m", "yt_dlp", "--version"],
capture_output=True, text=True, timeout=10)
if result.returncode == 0:
return True
except (subprocess.TimeoutExpired, subprocess.CalledProcessError):
pass
# Check binary file
binary_path = Path("downloader/yt-dlp_macos")
if binary_path.exists():
try:
result = subprocess.run([str(binary_path), "--version"],
capture_output=True, text=True, timeout=10)
return result.returncode == 0
except (subprocess.TimeoutExpired, subprocess.CalledProcessError):
pass
return False
def install_ffmpeg():
"""Install FFmpeg via Homebrew."""
print("🎬 Installing FFmpeg...")
# Check if Homebrew is installed
try:
subprocess.run(["brew", "--version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
print("❌ Homebrew is not installed. Please install Homebrew first:")
print(" /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"")
return False
try:
print("🍺 Installing FFmpeg via Homebrew...")
result = subprocess.run(["brew", "install", "ffmpeg"],
capture_output=True, text=True, check=True)
print("✅ FFmpeg installed successfully!")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Failed to install FFmpeg: {e}")
return False
def download_yt_dlp_binary():
"""Download yt-dlp binary for macOS."""
print("📥 Downloading yt-dlp binary for macOS...")
# Create downloader directory if it doesn't exist
downloader_dir = Path("downloader")
downloader_dir.mkdir(exist_ok=True)
# Download yt-dlp binary
binary_path = downloader_dir / "yt-dlp_macos"
url = "https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos"
try:
print(f"📡 Downloading from: {url}")
result = subprocess.run(["curl", "-L", "-o", str(binary_path), url],
capture_output=True, text=True, check=True)
# Make it executable
binary_path.chmod(0o755)
print(f"✅ yt-dlp binary downloaded to: {binary_path}")
# Test the binary
test_result = subprocess.run([str(binary_path), "--version"],
capture_output=True, text=True, timeout=10)
if test_result.returncode == 0:
version = test_result.stdout.strip()
print(f"✅ Binary test successful! Version: {version}")
return True
else:
print(f"❌ Binary test failed: {test_result.stderr}")
return False
except subprocess.CalledProcessError as e:
print(f"❌ Failed to download yt-dlp binary: {e}")
return False
except Exception as e:
print(f"❌ Error downloading binary: {e}")
return False
def install_yt_dlp():
"""Install yt-dlp via pip."""
print("📦 Installing yt-dlp...")
try:
result = subprocess.run([sys.executable, "-m", "pip", "install", "yt-dlp"],
capture_output=True, text=True, check=True)
print("✅ yt-dlp installed successfully!")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Failed to install yt-dlp: {e}")
return False
def test_installation():
"""Test the installation."""
print("\n🧪 Testing installation...")
# Test FFmpeg
if check_ffmpeg():
print("✅ FFmpeg is working!")
else:
print("❌ FFmpeg is not working")
return False
# Test yt-dlp
if check_yt_dlp():
print("✅ yt-dlp is working!")
else:
print("❌ yt-dlp is not working")
return False
return True
def main():
print("🍎 macOS Setup for Karaoke Video Downloader")
print("=" * 50)
# Check current status
print("🔍 Checking current installation...")
ffmpeg_installed = check_ffmpeg()
yt_dlp_installed = check_yt_dlp()
print(f"FFmpeg: {'✅ Installed' if ffmpeg_installed else '❌ Not installed'}")
print(f"yt-dlp: {'✅ Installed' if yt_dlp_installed else '❌ Not installed'}")
if ffmpeg_installed and yt_dlp_installed:
print("\n🎉 Everything is already installed and working!")
return
# Install missing components
print("\n🚀 Installing missing components...")
# Install FFmpeg if needed
if not ffmpeg_installed:
print("\n🎬 FFmpeg Installation Options:")
print("1. Install via Homebrew (recommended)")
print("2. Download from ffmpeg.org")
print("3. Skip FFmpeg installation")
choice = input("\nChoose an option (1-3): ").strip()
if choice == "1":
if not install_ffmpeg():
print("❌ FFmpeg installation failed")
return
elif choice == "2":
print("📥 Please download FFmpeg from: https://ffmpeg.org/download.html")
print(" Extract and add to your PATH, then run this script again.")
return
elif choice == "3":
print("⚠️ FFmpeg is required for video processing. Some features may not work.")
else:
print("❌ Invalid choice")
return
# Install yt-dlp if needed
if not yt_dlp_installed:
print("\n📦 yt-dlp Installation Options:")
print("1. Install via pip (recommended)")
print("2. Download binary file")
print("3. Skip yt-dlp installation")
choice = input("\nChoose an option (1-3): ").strip()
if choice == "1":
if not install_yt_dlp():
print("❌ yt-dlp installation failed")
return
elif choice == "2":
if not download_yt_dlp_binary():
print("❌ yt-dlp binary download failed")
return
elif choice == "3":
print("❌ yt-dlp is required for video downloading.")
return
else:
print("❌ Invalid choice")
return
# Test installation
if test_installation():
print("\n🎉 Setup completed successfully!")
print("You can now use the Karaoke Video Downloader on macOS.")
print("Run: python download_karaoke.py --help")
else:
print("\n❌ Setup failed. Please check the error messages above.")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,198 @@
#!/usr/bin/env python3
"""
Helper script to add manual videos to the manual videos collection.
"""
import json
import re
from pathlib import Path
from typing import Dict, List, Optional
from karaoke_downloader.data_path_manager import get_data_path_manager
def extract_video_id(url: str) -> Optional[str]:
"""Extract video ID from YouTube URL."""
patterns = [
r'(?:youtube\.com/watch\?v=|youtu\.be/|youtube\.com/embed/)([a-zA-Z0-9_-]{11})',
r'youtube\.com/watch\?.*v=([a-zA-Z0-9_-]{11})'
]
for pattern in patterns:
match = re.search(pattern, url)
if match:
return match.group(1)
return None
def add_manual_video(title: str, url: str, manual_file: str = None):
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""
Add a manual video to the collection.
Args:
title: Video title (e.g., "Artist - Song (Karaoke Version)")
url: YouTube URL
manual_file: Path to manual videos JSON file
"""
manual_path = Path(manual_file)
# Load existing data or create new
if manual_path.exists():
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
else:
data = {
"channel_name": "@ManualVideos",
"channel_url": "manual://static",
"description": "Manual collection of individual karaoke videos",
"videos": [],
"parsing_rules": {
"format": "artist_title_separator",
"separator": " - ",
"artist_first": true,
"title_cleanup": {
"remove_suffix": {
"suffixes": ["(Karaoke)", "(Karaoke Version)", "(Karaoke Version) Lyrics"]
}
}
}
}
# Extract video ID
video_id = extract_video_id(url)
if not video_id:
print(f"❌ Could not extract video ID from URL: {url}")
return False
# Check if video already exists
existing_ids = [video.get("id") for video in data["videos"]]
if video_id in existing_ids:
print(f"⚠️ Video already exists: {title}")
return False
# Add new video
new_video = {
"title": title,
"url": url,
"id": video_id,
"upload_date": "2024-01-01", # Default date
"duration": 180, # Default duration
"view_count": 1000 # Default view count
}
data["videos"].append(new_video)
# Save updated data
manual_path.parent.mkdir(parents=True, exist_ok=True)
with open(manual_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"✅ Added video: {title}")
print(f" URL: {url}")
print(f" ID: {video_id}")
return True
def list_manual_videos(manual_file: str = None):
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""List all manual videos."""
manual_path = Path(manual_file)
if not manual_path.exists():
print("❌ No manual videos file found")
return
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
print(f"📋 Manual Videos ({len(data['videos'])} videos):")
print("=" * 60)
for i, video in enumerate(data['videos'], 1):
print(f"{i:2d}. {video['title']}")
print(f" URL: {video['url']}")
print(f" ID: {video['id']}")
print()
def remove_manual_video(video_id: str, manual_file: str = None):
if manual_file is None:
manual_file = str(get_data_path_manager().get_manual_videos_path())
"""Remove a manual video by ID."""
manual_path = Path(manual_file)
if not manual_path.exists():
print("❌ No manual videos file found")
return False
with open(manual_path, 'r', encoding='utf-8') as f:
data = json.load(f)
# Find and remove video
for i, video in enumerate(data['videos']):
if video['id'] == video_id:
removed_video = data['videos'].pop(i)
with open(manual_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"✅ Removed video: {removed_video['title']}")
return True
print(f"❌ Video with ID '{video_id}' not found")
return False
def main():
"""Interactive mode for adding manual videos."""
print("🎤 Manual Video Manager")
print("=" * 30)
print("1. Add video")
print("2. List videos")
print("3. Remove video")
print("4. Exit")
while True:
choice = input("\nSelect option (1-4): ").strip()
if choice == "1":
title = input("Enter video title (e.g., 'Artist - Song (Karaoke Version)'): ").strip()
url = input("Enter YouTube URL: ").strip()
if title and url:
add_manual_video(title, url)
else:
print("❌ Title and URL are required")
elif choice == "2":
list_manual_videos()
elif choice == "3":
video_id = input("Enter video ID to remove: ").strip()
if video_id:
remove_manual_video(video_id)
else:
print("❌ Video ID is required")
elif choice == "4":
print("👋 Goodbye!")
break
else:
print("❌ Invalid option")
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
# Command line mode
if sys.argv[1] == "add" and len(sys.argv) >= 4:
add_manual_video(sys.argv[2], sys.argv[3])
elif sys.argv[1] == "list":
list_manual_videos()
elif sys.argv[1] == "remove" and len(sys.argv) >= 3:
remove_manual_video(sys.argv[2])
else:
print("Usage:")
print(" python add_manual_video.py add 'Title' 'URL'")
print(" python add_manual_video.py list")
print(" python add_manual_video.py remove VIDEO_ID")
else:
# Interactive mode
main()

View File

@ -0,0 +1,127 @@
#!/usr/bin/env python3
"""
Script to build channel cache from raw yt-dlp output file.
This uses the fixed parsing logic to handle titles with | characters.
"""
import json
import re
from datetime import datetime
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
def parse_raw_output_file(raw_file_path):
"""Parse the raw output file and extract valid videos."""
videos = []
invalid_count = 0
print(f"🔍 Parsing raw output file: {raw_file_path}")
with open(raw_file_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
# Skip header lines (lines starting with #)
data_lines = [line for line in lines if not line.strip().startswith('#') and line.strip()]
print(f"📄 Found {len(data_lines)} data lines to process")
for i, line in enumerate(data_lines):
if i % 1000 == 0 and i > 0: # Progress indicator every 1000 lines
print(f"📊 Processing line {i}/{len(data_lines)}... ({i/len(data_lines)*100:.1f}%)")
# Remove line number prefix (e.g., " 1234: ")
line = re.sub(r'^\s*\d+:\s*', '', line.strip())
# More robust parsing that handles titles with | characters
# Extract video ID directly from the URL that yt-dlp provides
# Find the URL and extract video ID from it
url_match = re.search(r'https://www\.youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})', line)
if not url_match:
invalid_count += 1
if invalid_count <= 5:
print(f"⚠️ Skipping line with no URL: '{line[:100]}...'")
elif invalid_count == 6:
print(f"⚠️ ... and {len(data_lines) - i - 1} more invalid lines")
continue
# Extract video ID directly from the URL
video_id = url_match.group(1)
# Extract title (everything before the video ID in the line)
title = line[:line.find(video_id)].rstrip('|').strip()
# Validate video ID
if video_id and (
len(video_id) == 11 and
video_id.replace('-', '').replace('_', '').isalnum() and
" " not in video_id and
"Lyrics" not in video_id and
"KARAOKE" not in video_id.upper() and
"Vocal" not in video_id and
"Guide" not in video_id
):
videos.append({"title": title, "id": video_id})
else:
invalid_count += 1
if invalid_count <= 5: # Only show first 5 invalid IDs
print(f"⚠️ Skipping invalid video ID: '{video_id}' for title: '{title[:50]}...'")
elif invalid_count == 6:
print(f"⚠️ ... and {len(data_lines) - i - 1} more invalid IDs")
print(f"✅ Parsed {len(videos)} valid videos from raw output")
print(f"⚠️ Skipped {invalid_count} invalid video IDs")
return videos
def save_cache_file(channel_id, videos, cache_dir=None):
if cache_dir is None:
cache_dir = str(get_data_path_manager().get_channel_cache_dir())
"""Save the parsed videos to a cache file."""
cache_dir = Path(cache_dir)
cache_dir.mkdir(parents=True, exist_ok=True)
# Sanitize channel ID for filename
safe_channel_id = re.sub(r'[<>:"/\\|?*]', '_', channel_id)
cache_file = cache_dir / f"{safe_channel_id}.json"
data = {
'channel_id': channel_id,
'videos': videos,
'last_updated': datetime.now().isoformat(),
'video_count': len(videos)
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"💾 Saved cache to: {cache_file.name}")
return cache_file
def main():
"""Main function to build cache from raw output."""
data_path_manager = get_data_path_manager()
raw_file_path = data_path_manager.get_channel_cache_dir() / "@VocalStarKaraoke_raw_output.txt"
if not raw_file_path.exists():
print(f"❌ Raw output file not found: {raw_file_path}")
return
# Parse the raw output file
videos = parse_raw_output_file(raw_file_path)
if not videos:
print("❌ No valid videos found")
return
# Save to cache file
channel_id = "@VocalStarKaraoke"
cache_file = save_cache_file(channel_id, videos)
print(f"🎉 Cache build complete!")
print(f"📊 Total videos in cache: {len(videos)}")
print(f"📁 Cache file: {cache_file}")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,164 @@
#!/usr/bin/env python3
"""
Utility script to identify and clean up duplicate files with (2), (3) suffixes.
This helps clean up files that were created before the duplicate prevention was implemented.
"""
import json
import re
from pathlib import Path
from typing import Dict, List, Tuple
def find_duplicate_files(downloads_dir: str = "downloads") -> Dict[str, List[Path]]:
"""
Find duplicate files with (2), (3), etc. suffixes in the downloads directory.
Args:
downloads_dir: Path to downloads directory
Returns:
Dictionary mapping base filenames to lists of duplicate files
"""
downloads_path = Path(downloads_dir)
if not downloads_path.exists():
print(f"❌ Downloads directory not found: {downloads_dir}")
return {}
duplicates = {}
# Scan all MP4 files in the downloads directory
for mp4_file in downloads_path.rglob("*.mp4"):
filename = mp4_file.name
# Check if this is a duplicate file with (2), (3), etc.
match = re.match(r'^(.+?)\s*\((\d+)\)\.mp4$', filename)
if match:
base_name = match.group(1)
suffix_num = int(match.group(2))
if base_name not in duplicates:
duplicates[base_name] = []
duplicates[base_name].append((mp4_file, suffix_num))
# Sort duplicates by suffix number
for base_name in duplicates:
duplicates[base_name].sort(key=lambda x: x[1])
return duplicates
def analyze_duplicates(duplicates: Dict[str, List[Tuple[Path, int]]]) -> None:
"""
Analyze and display information about found duplicates.
Args:
duplicates: Dictionary of duplicate files
"""
if not duplicates:
print("✅ No duplicate files found!")
return
print(f"🔍 Found {len(duplicates)} sets of duplicate files:")
print()
total_duplicates = 0
for base_name, files in duplicates.items():
print(f"📁 {base_name}")
for file_path, suffix in files:
file_size = file_path.stat().st_size / (1024 * 1024) # MB
print(f" ({suffix}) {file_path.name} - {file_size:.1f} MB")
print()
total_duplicates += len(files) - 1 # -1 because we keep the original
print(f"📊 Summary: {len(duplicates)} base files with {total_duplicates} duplicate files")
def cleanup_duplicates(duplicates: Dict[str, List[Tuple[Path, int]]], dry_run: bool = True) -> None:
"""
Clean up duplicate files, keeping only the first occurrence.
Args:
duplicates: Dictionary of duplicate files
dry_run: If True, only show what would be deleted without actually deleting
"""
if not duplicates:
print("✅ No duplicates to clean up!")
return
mode = "DRY RUN" if dry_run else "ACTUAL CLEANUP"
print(f"🧹 Starting {mode}...")
print()
total_deleted = 0
total_size_freed = 0
for base_name, files in duplicates.items():
print(f"📁 Processing: {base_name}")
# Keep the first file (lowest suffix number), delete the rest
files_to_delete = files[1:] # Skip the first file
for file_path, suffix in files_to_delete:
file_size = file_path.stat().st_size / (1024 * 1024) # MB
if dry_run:
print(f" 🗑️ Would delete: {file_path.name} ({file_size:.1f} MB)")
else:
try:
file_path.unlink()
print(f" ✅ Deleted: {file_path.name} ({file_size:.1f} MB)")
total_deleted += 1
total_size_freed += file_size
except Exception as e:
print(f" ❌ Failed to delete {file_path.name}: {e}")
print()
if dry_run:
print(f"📊 DRY RUN SUMMARY: Would delete {len([f for files in duplicates.values() for f in files[1:]])} files")
else:
print(f"📊 CLEANUP SUMMARY: Deleted {total_deleted} files, freed {total_size_freed:.1f} MB")
def main():
"""Main function to run the duplicate file cleanup."""
print("🎵 Karaoke Video Downloader - Duplicate File Cleanup")
print("=" * 50)
print()
# Find duplicates
duplicates = find_duplicate_files()
if not duplicates:
print("✅ No duplicate files found!")
return
# Analyze duplicates
analyze_duplicates(duplicates)
print()
# Ask user what to do
while True:
print("Options:")
print("1. Dry run (show what would be deleted)")
print("2. Actually delete duplicate files")
print("3. Exit without doing anything")
choice = input("\nEnter your choice (1-3): ").strip()
if choice == "1":
cleanup_duplicates(duplicates, dry_run=True)
break
elif choice == "2":
confirm = input("⚠️ Are you sure you want to delete duplicate files? (yes/no): ").strip().lower()
if confirm in ["yes", "y"]:
cleanup_duplicates(duplicates, dry_run=False)
else:
print("❌ Cleanup cancelled.")
break
elif choice == "3":
print("❌ Exiting without cleanup.")
break
else:
print("❌ Invalid choice. Please enter 1, 2, or 3.")
if __name__ == "__main__":
main()

View File

@ -2,7 +2,11 @@ import json
from pathlib import Path
from datetime import datetime, time
def cleanup_recent_tracking(tracking_path="data/songlist_tracking.json", cutoff_time_str="11:00"):
from karaoke_downloader.data_path_manager import get_data_path_manager
def cleanup_recent_tracking(tracking_path=None, cutoff_time_str="11:00"):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
"""Remove entries from songlist_tracking.json that were added after the specified time today."""
tracking_file = Path(tracking_path)
if not tracking_file.exists():

View File

@ -0,0 +1,465 @@
#!/usr/bin/env python3
"""
Fix artist name formatting for Let's Sing Karaoke channel.
This script specifically targets the "Last Name, First Name" format and converts it to
"First Name Last Name" format in ID3 tags. It only processes entries where there is exactly one comma
followed by exactly 2 words, to avoid affecting multi-artist entries.
Usage:
python fix_artist_name_format.py --preview # Show what would be changed
python fix_artist_name_format.py --apply # Actually make the changes
python fix_artist_name_format.py --external "D:\Karaoke\Karaoke\MP4\Let's Sing Karaoke" # Use external directory
"""
import json
import os
import re
import shutil
import argparse
from pathlib import Path
from typing import Dict, List, Tuple, Optional
# Try to import mutagen for ID3 tag manipulation
try:
from mutagen.mp4 import MP4
MUTAGEN_AVAILABLE = True
except ImportError:
MUTAGEN_AVAILABLE = False
print("⚠️ mutagen not available - install with: pip install mutagen")
def is_lastname_firstname_format(artist_name: str) -> bool:
"""
Check if artist name is in "Last Name, First Name" format.
Args:
artist_name: The artist name to check
Returns:
True if the name matches "Last Name, First Name" format with exactly 2 words after comma
"""
if ',' not in artist_name:
return False
# Split by comma
parts = artist_name.split(',', 1)
if len(parts) != 2:
return False
last_name = parts[0].strip()
first_name_part = parts[1].strip()
# Check if there are exactly 2 words after the comma
words_after_comma = first_name_part.split()
if len(words_after_comma) != 2:
return False
# Additional check: make sure it's not a multi-artist entry
# If there are more than 2 words total in the artist name, it might be multi-artist
total_words = len(artist_name.split())
if total_words > 4: # Last, First Name (4 words max for single artist)
return False
return True
def convert_to_firstname_lastname(artist_name: str) -> str:
"""
Convert "Last Name, First Name" to "First Name Last Name".
Args:
artist_name: Artist name in "Last Name, First Name" format
Returns:
Artist name in "First Name Last Name" format
"""
parts = artist_name.split(',', 1)
last_name = parts[0].strip()
first_name_part = parts[1].strip()
# Split the first name part into words
words = first_name_part.split()
if len(words) == 2:
first_name = words[0]
middle_name = words[1]
return f"{first_name} {middle_name} {last_name}"
else:
# Fallback - just reverse the parts
return f"{first_name_part} {last_name}"
def extract_artist_title_from_filename(filename: str) -> Tuple[str, str]:
"""
Extract artist and title from a filename.
Args:
filename: MP4 filename (without extension)
Returns:
Tuple of (artist, title)
"""
# Remove .mp4 extension
if filename.endswith('.mp4'):
filename = filename[:-4]
# Look for " - " separator
if " - " in filename:
parts = filename.split(" - ", 1)
return parts[0].strip(), parts[1].strip()
return "", filename
def update_id3_tags(file_path: str, new_artist: str, apply_changes: bool = False) -> bool:
"""
Update the ID3 tags in an MP4 file.
Args:
file_path: Path to the MP4 file
new_artist: New artist name to set
apply_changes: Whether to actually apply changes or just preview
Returns:
True if successful, False otherwise
"""
if not MUTAGEN_AVAILABLE:
print(f"⚠️ mutagen not available - cannot update ID3 tags for {file_path}")
return False
try:
mp4 = MP4(file_path)
if apply_changes:
# Update the artist tag
mp4["\xa9ART"] = new_artist
mp4.save()
print(f"📝 Updated ID3 tag: {os.path.basename(file_path)} → Artist: '{new_artist}'")
else:
# Just preview what would be changed
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
print(f"📝 Would update ID3 tag: {os.path.basename(file_path)} → Artist: '{current_artist}''{new_artist}'")
return True
except Exception as e:
print(f"❌ Failed to update ID3 tags for {file_path}: {e}")
return False
def scan_external_directory(directory_path: str) -> List[Dict]:
"""
Scan external directory for MP4 files with "Last Name, First Name" format in ID3 tags.
Args:
directory_path: Path to the external directory
Returns:
List of files that need ID3 tag updates
"""
if not os.path.exists(directory_path):
print(f"❌ Directory not found: {directory_path}")
return []
if not MUTAGEN_AVAILABLE:
print("❌ mutagen not available - cannot scan ID3 tags")
return []
files_to_update = []
# Scan for MP4 files
for file_path in Path(directory_path).glob("*.mp4"):
try:
mp4 = MP4(str(file_path))
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
if current_artist and is_lastname_firstname_format(current_artist):
new_artist = convert_to_firstname_lastname(current_artist)
files_to_update.append({
'file_path': str(file_path),
'filename': file_path.name,
'old_artist': current_artist,
'new_artist': new_artist
})
except Exception as e:
print(f"⚠️ Could not read ID3 tags from {file_path.name}: {e}")
return files_to_update
def update_tracking_file(tracking_file: str, channel_name: str = "Let's Sing Karaoke", apply_changes: bool = False) -> Tuple[int, List[Dict]]:
"""
Update the karaoke tracking file to fix artist name formatting.
Args:
tracking_file: Path to the tracking JSON file
channel_name: Channel name to target (default: Let's Sing Karaoke)
apply_changes: Whether to actually apply changes or just preview
Returns:
Tuple of (number of changes made, list of changed entries)
"""
if not os.path.exists(tracking_file):
print(f"❌ Tracking file not found: {tracking_file}")
return 0, []
# Load the tracking data
with open(tracking_file, 'r', encoding='utf-8') as f:
data = json.load(f)
changes_made = 0
changed_entries = []
# Process songs
for song_key, song_data in data.get('songs', {}).items():
if song_data.get('channel_name') != channel_name:
continue
artist = song_data.get('artist', '')
if not artist or not is_lastname_firstname_format(artist):
continue
# Convert the artist name
new_artist = convert_to_firstname_lastname(artist)
if apply_changes:
# Update the tracking data
song_data['artist'] = new_artist
# Update the video title if it exists and contains the old artist name
video_title = song_data.get('video_title', '')
if video_title and artist in video_title:
song_data['video_title'] = video_title.replace(artist, new_artist)
# Update the file path if it exists
file_path = song_data.get('file_path', '')
if file_path and artist in file_path:
song_data['file_path'] = file_path.replace(artist, new_artist)
changes_made += 1
changed_entries.append({
'song_key': song_key,
'old_artist': artist,
'new_artist': new_artist,
'title': song_data.get('title', ''),
'file_path': song_data.get('file_path', '')
})
print(f"🔄 {'Updated' if apply_changes else 'Would update'}: '{artist}''{new_artist}' ({song_data.get('title', '')})")
# Save the updated data
if apply_changes and changes_made > 0:
# Create backup
backup_file = f"{tracking_file}.backup"
shutil.copy2(tracking_file, backup_file)
print(f"💾 Created backup: {backup_file}")
# Save updated file
with open(tracking_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"💾 Updated tracking file: {tracking_file}")
return changes_made, changed_entries
def update_songlist_tracking(songlist_file: str, channel_name: str = "Let's Sing Karaoke", apply_changes: bool = False) -> Tuple[int, List[Dict]]:
"""
Update the songlist tracking file to fix artist name formatting.
Args:
songlist_file: Path to the songlist tracking JSON file
channel_name: Channel name to target (default: Let's Sing Karaoke)
apply_changes: Whether to actually apply changes or just preview
Returns:
Tuple of (number of changes made, list of changed entries)
"""
if not os.path.exists(songlist_file):
print(f"❌ Songlist tracking file not found: {songlist_file}")
return 0, []
# Load the songlist data
with open(songlist_file, 'r', encoding='utf-8') as f:
data = json.load(f)
changes_made = 0
changed_entries = []
# Process songlist entries
for song_key, song_data in data.items():
artist = song_data.get('artist', '')
if not artist or not is_lastname_firstname_format(artist):
continue
# Convert the artist name
new_artist = convert_to_firstname_lastname(artist)
if apply_changes:
# Update the songlist data
song_data['artist'] = new_artist
changes_made += 1
changed_entries.append({
'song_key': song_key,
'old_artist': artist,
'new_artist': new_artist,
'title': song_data.get('title', '')
})
print(f"🔄 {'Updated' if apply_changes else 'Would update'} songlist: '{artist}''{new_artist}' ({song_data.get('title', '')})")
# Save the updated data
if apply_changes and changes_made > 0:
# Create backup
backup_file = f"{songlist_file}.backup"
shutil.copy2(songlist_file, backup_file)
print(f"💾 Created backup: {backup_file}")
# Save updated file
with open(songlist_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"💾 Updated songlist file: {songlist_file}")
return changes_made, changed_entries
def update_id3_tags_for_files(files_to_update: List[Dict], apply_changes: bool = False) -> int:
"""
Update ID3 tags for a list of files.
Args:
files_to_update: List of files to update
apply_changes: Whether to actually apply changes or just preview
Returns:
Number of files successfully updated
"""
updated_count = 0
for file_info in files_to_update:
file_path = file_info['file_path']
new_artist = file_info['new_artist']
if update_id3_tags(file_path, new_artist, apply_changes):
updated_count += 1
return updated_count
def main():
"""Main function to run the artist name fix script."""
parser = argparse.ArgumentParser(description="Fix artist name formatting in ID3 tags for Let's Sing Karaoke")
parser.add_argument('--preview', action='store_true', help='Show what would be changed without making changes')
parser.add_argument('--apply', action='store_true', help='Actually apply the changes')
parser.add_argument('--external', type=str, help='Path to external karaoke directory')
args = parser.parse_args()
# Default to preview mode if no action specified
if not args.preview and not args.apply:
args.preview = True
print("🎤 Artist Name Format Fix Script (ID3 Tags Only)")
print("=" * 60)
print("This script will fix 'Last Name, First Name' format to 'First Name Last Name'")
print("Only targeting Let's Sing Karaoke channel to avoid affecting other channels.")
print("Focusing on ID3 tags only - filenames will not be changed.")
print()
if not MUTAGEN_AVAILABLE:
print("❌ mutagen library not available!")
print("Please install it with: pip install mutagen")
return
if args.preview:
print("🔍 PREVIEW MODE - No changes will be made")
else:
print("⚡ APPLY MODE - Changes will be made")
print()
# File paths
tracking_file = "data/karaoke_tracking.json"
songlist_file = "data/songlist_tracking.json"
# Process external directory if specified
if args.external:
print(f"📁 Scanning external directory: {args.external}")
external_files = scan_external_directory(args.external)
if external_files:
print(f"\n📋 Found {len(external_files)} files with 'Last Name, First Name' format in ID3 tags:")
for file_info in external_files:
print(f"{file_info['filename']}: '{file_info['old_artist']}''{file_info['new_artist']}'")
if args.apply:
print(f"\n📝 Updating ID3 tags in external files...")
updated_count = update_id3_tags_for_files(external_files, apply_changes=True)
print(f"✅ Updated ID3 tags in {updated_count} external files")
else:
print(f"\n📝 Would update ID3 tags in {len(external_files)} external files")
else:
print("✅ No files with 'Last Name, First Name' format found in ID3 tags")
# Process tracking files (only if they exist in current project)
if os.path.exists(tracking_file):
print(f"\n📊 Processing karaoke tracking file...")
tracking_changes, tracking_entries = update_tracking_file(tracking_file, apply_changes=args.apply)
else:
print(f"\n⚠️ Tracking file not found: {tracking_file}")
tracking_changes = 0
if os.path.exists(songlist_file):
print(f"\n📊 Processing songlist tracking file...")
songlist_changes, songlist_entries = update_songlist_tracking(songlist_file, apply_changes=args.apply)
else:
print(f"\n⚠️ Songlist tracking file not found: {songlist_file}")
songlist_changes = 0
# Process local downloads directory ID3 tags
downloads_dir = "downloads"
local_id3_updates = 0
if os.path.exists(downloads_dir) and tracking_changes > 0:
print(f"\n📝 Processing ID3 tags in local downloads directory...")
# Scan local downloads for files that need ID3 tag updates
local_files = []
for entry in tracking_entries:
file_path = entry.get('file_path', '')
if file_path and os.path.exists(file_path.replace('\\', '/')):
local_files.append({
'file_path': file_path.replace('\\', '/'),
'filename': os.path.basename(file_path),
'old_artist': entry['old_artist'],
'new_artist': entry['new_artist']
})
if local_files:
local_id3_updates = update_id3_tags_for_files(local_files, apply_changes=args.apply)
total_changes = tracking_changes + songlist_changes
print("\n" + "=" * 60)
print("📋 Summary:")
print(f" • Tracking file changes: {tracking_changes}")
print(f" • Songlist file changes: {songlist_changes}")
print(f" • Local ID3 tag updates: {local_id3_updates}")
print(f" • Total changes: {total_changes}")
if args.external:
external_count = len(scan_external_directory(args.external)) if args.preview else len(external_files)
print(f" • External ID3 tag updates: {external_count}")
if total_changes > 0 or (args.external and external_count > 0):
if args.apply:
print("\n✅ Artist name formatting in ID3 tags has been fixed!")
print("💾 Backups have been created for all modified files.")
print("🔄 You may need to re-run your karaoke downloader to update any cached data.")
else:
print("\n🔍 Preview complete. Use --apply to make these changes.")
else:
print("\n✅ No changes needed! All artist names are already in the correct format.")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,295 @@
#!/usr/bin/env python3
"""
Fix artist name formatting for Let's Sing Karaoke channel.
This script specifically targets the "Last Name, First Name" format and converts it to
"First Name Last Name" format in ID3 tags. It only processes entries where there is exactly one comma
followed by exactly 2 words, to avoid affecting multi-artist entries.
Usage:
python fix_artist_name_format_simple.py --preview # Show what would be changed
python fix_artist_name_format_simple.py --apply # Actually make the changes
python fix_artist_name_format_simple.py --external "D:\Karaoke\Karaoke\MP4\Let's Sing Karaoke" # Use external directory
"""
import json
import os
import re
import shutil
import argparse
from pathlib import Path
from typing import Dict, List, Tuple, Optional
# Try to import mutagen for ID3 tag manipulation
try:
from mutagen.mp4 import MP4
MUTAGEN_AVAILABLE = True
except ImportError:
MUTAGEN_AVAILABLE = False
print("WARNING: mutagen not available - install with: pip install mutagen")
def is_lastname_firstname_format(artist_name: str) -> bool:
"""
Check if artist name is in "Last Name, First Name" format.
Args:
artist_name: The artist name to check
Returns:
True if the name matches "Last Name, First Name" format with exactly 1 or 2 words after comma
"""
if ',' not in artist_name:
return False
# Split by comma
parts = artist_name.split(',', 1)
if len(parts) != 2:
return False
last_name = parts[0].strip()
first_name_part = parts[1].strip()
# Check if there are exactly 1 or 2 words after the comma
words_after_comma = first_name_part.split()
if len(words_after_comma) not in [1, 2]:
return False
# Additional check: make sure it's not a multi-artist entry
# If there are more than 4 words total in the artist name, it might be multi-artist
total_words = len(artist_name.split())
if total_words > 4: # Last, First Name (4 words max for single artist)
return False
return True
def convert_lastname_firstname(artist_name: str) -> str:
"""
Convert "Last Name, First Name" to "First Name Last Name".
Args:
artist_name: The artist name to convert
Returns:
The converted artist name
"""
if ',' not in artist_name:
return artist_name
parts = artist_name.split(',', 1)
if len(parts) != 2:
return artist_name
last_name = parts[0].strip()
first_name = parts[1].strip()
return f"{first_name} {last_name}"
def process_artist_name(artist_name: str) -> str:
"""
Process an artist name, handling both single artists and multiple artists separated by "&".
Args:
artist_name: The artist name to process
Returns:
The processed artist name
"""
if '&' in artist_name:
# Split by "&" and process each artist individually
artists = [artist.strip() for artist in artist_name.split('&')]
processed_artists = []
for artist in artists:
if is_lastname_firstname_format(artist):
processed_artist = convert_lastname_firstname(artist)
processed_artists.append(processed_artist)
else:
processed_artists.append(artist)
# Rejoin with "&"
return ' & '.join(processed_artists)
else:
# Single artist
if is_lastname_firstname_format(artist_name):
return convert_lastname_firstname(artist_name)
else:
return artist_name
def update_id3_tags(file_path: str, new_artist: str, apply_changes: bool = False) -> bool:
"""
Update the ID3 tags in an MP4 file.
Args:
file_path: Path to the MP4 file
new_artist: New artist name to set
apply_changes: Whether to actually apply changes or just preview
Returns:
True if successful, False otherwise
"""
if not MUTAGEN_AVAILABLE:
print(f"WARNING: mutagen not available - cannot update ID3 tags for {file_path}")
return False
try:
mp4 = MP4(file_path)
if apply_changes:
# Update the artist tag
mp4["\xa9ART"] = new_artist
mp4.save()
print(f"UPDATED ID3 tag: {os.path.basename(file_path)} -> Artist: '{new_artist}'")
else:
# Just preview what would be changed
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
print(f"WOULD UPDATE ID3 tag: {os.path.basename(file_path)} -> Artist: '{current_artist}' -> '{new_artist}'")
return True
except Exception as e:
print(f"ERROR: Failed to update ID3 tags for {file_path}: {e}")
return False
def scan_external_directory(directory_path: str, debug: bool = False) -> List[Dict]:
"""
Scan external directory for MP4 files with "Last Name, First Name" format in ID3 tags.
Args:
directory_path: Path to the external directory
debug: Whether to show debug information
Returns:
List of files that need ID3 tag updates
"""
if not os.path.exists(directory_path):
print(f"ERROR: Directory not found: {directory_path}")
return []
if not MUTAGEN_AVAILABLE:
print("ERROR: mutagen not available - cannot scan ID3 tags")
return []
files_to_update = []
total_files = 0
files_with_artist_tags = 0
# Scan for MP4 files
for file_path in Path(directory_path).glob("*.mp4"):
total_files += 1
try:
mp4 = MP4(str(file_path))
current_artist = mp4.get("\xa9ART", ["Unknown"])[0] if "\xa9ART" in mp4 else "Unknown"
if current_artist != "Unknown":
files_with_artist_tags += 1
if debug:
print(f"DEBUG: {file_path.name} -> Artist: '{current_artist}'")
# Process the artist name to handle multiple artists
processed_artist = process_artist_name(current_artist)
if processed_artist != current_artist:
files_to_update.append({
'file_path': str(file_path),
'filename': file_path.name,
'old_artist': current_artist,
'new_artist': processed_artist
})
if debug:
print(f"DEBUG: MATCH FOUND - {file_path.name}: '{current_artist}' -> '{processed_artist}'")
except Exception as e:
if debug:
print(f"WARNING: Could not read ID3 tags from {file_path.name}: {e}")
print(f"INFO: Scanned {total_files} MP4 files, {files_with_artist_tags} had artist tags, {len(files_to_update)} need updates")
return files_to_update
def update_id3_tags_for_files(files_to_update: List[Dict], apply_changes: bool = False) -> int:
"""
Update ID3 tags for a list of files.
Args:
files_to_update: List of files to update
apply_changes: Whether to actually apply changes or just preview
Returns:
Number of files successfully updated
"""
updated_count = 0
for file_info in files_to_update:
file_path = file_info['file_path']
new_artist = file_info['new_artist']
if update_id3_tags(file_path, new_artist, apply_changes):
updated_count += 1
return updated_count
def main():
"""Main function to run the artist name fix script."""
parser = argparse.ArgumentParser(description="Fix artist name formatting in ID3 tags for Let's Sing Karaoke")
parser.add_argument('--preview', action='store_true', help='Show what would be changed without making changes')
parser.add_argument('--apply', action='store_true', help='Actually apply the changes')
parser.add_argument('--external', type=str, help='Path to external karaoke directory')
parser.add_argument('--debug', action='store_true', help='Show debug information')
args = parser.parse_args()
# Default to preview mode if no action specified
if not args.preview and not args.apply:
args.preview = True
print("Artist Name Format Fix Script (ID3 Tags Only)")
print("=" * 60)
print("This script will fix 'Last Name, First Name' format to 'First Name Last Name'")
print("Only targeting Let's Sing Karaoke channel to avoid affecting other channels.")
print("Focusing on ID3 tags only - filenames will not be changed.")
print()
if not MUTAGEN_AVAILABLE:
print("ERROR: mutagen library not available!")
print("Please install it with: pip install mutagen")
return
if args.preview:
print("PREVIEW MODE - No changes will be made")
else:
print("APPLY MODE - Changes will be made")
print()
# Process external directory if specified
if args.external:
print(f"Scanning external directory: {args.external}")
external_files = scan_external_directory(args.external, debug=args.debug)
if external_files:
print(f"\nFound {len(external_files)} files with 'Last Name, First Name' format in ID3 tags:")
for file_info in external_files:
print(f" * {file_info['filename']}: '{file_info['old_artist']}' -> '{file_info['new_artist']}'")
if args.apply:
print(f"\nUpdating ID3 tags in external files...")
updated_count = update_id3_tags_for_files(external_files, apply_changes=True)
print(f"SUCCESS: Updated ID3 tags in {updated_count} external files")
else:
print(f"\nWould update ID3 tags in {len(external_files)} external files")
else:
print("SUCCESS: No files with 'Last Name, First Name' format found in ID3 tags")
print("\n" + "=" * 60)
print("Summary complete.")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""
Script to reset karaoke tracking and re-download files with the new channel parser.
This script will:
1. Reset the karaoke_tracking.json to remove all downloaded entries
2. Optionally delete the downloaded files
3. Allow you to re-download with the new channel parser system
"""
import json
import os
import shutil
from pathlib import Path
from typing import List, Dict, Any
from karaoke_downloader.data_path_manager import get_data_path_manager
def reset_karaoke_tracking(tracking_file: str = None) -> None:
if tracking_file is None:
tracking_file = str(get_data_path_manager().get_karaoke_tracking_path())
"""Reset the karaoke tracking file to empty state."""
print(f"Resetting {tracking_file}...")
# Create backup of current tracking
backup_file = f"{tracking_file}.backup"
if os.path.exists(tracking_file):
shutil.copy2(tracking_file, backup_file)
print(f"Created backup: {backup_file}")
# Reset to empty state
empty_tracking = {
"playlists": {},
"songs": {}
}
with open(tracking_file, 'w', encoding='utf-8') as f:
json.dump(empty_tracking, f, indent=2, ensure_ascii=False)
print(f"✅ Reset {tracking_file} to empty state")
def delete_downloaded_files(downloads_dir: str = "downloads") -> None:
"""Delete all downloaded files and folders."""
if not os.path.exists(downloads_dir):
print(f"Downloads directory {downloads_dir} does not exist.")
return
print(f"Deleting all files in {downloads_dir}...")
try:
shutil.rmtree(downloads_dir)
print(f"✅ Deleted {downloads_dir} directory")
except Exception as e:
print(f"❌ Error deleting {downloads_dir}: {e}")
def show_download_stats(tracking_file: str = None) -> None:
if tracking_file is None:
tracking_file = str(get_data_path_manager().get_karaoke_tracking_path())
"""Show statistics about current downloads."""
if not os.path.exists(tracking_file):
print("No tracking file found.")
return
with open(tracking_file, 'r', encoding='utf-8') as f:
tracking = json.load(f)
songs = tracking.get("songs", {})
total_songs = len(songs)
if total_songs == 0:
print("No songs in tracking file.")
return
# Count by status
status_counts = {}
channel_counts = {}
for song_id, song_data in songs.items():
status = song_data.get("status", "UNKNOWN")
channel = song_data.get("channel_name", "UNKNOWN")
status_counts[status] = status_counts.get(status, 0) + 1
channel_counts[channel] = channel_counts.get(channel, 0) + 1
print(f"\n📊 Current Download Statistics:")
print(f"Total songs: {total_songs}")
print(f"\nBy Status:")
for status, count in status_counts.items():
print(f" {status}: {count}")
print(f"\nBy Channel:")
for channel, count in channel_counts.items():
print(f" {channel}: {count}")
def main():
"""Main function to handle reset and re-download process."""
print("🔄 Karaoke Download Reset and Re-download Tool")
print("=" * 50)
# Show current stats
print("\nCurrent download statistics:")
show_download_stats()
# Ask user what they want to do
print("\nOptions:")
print("1. Reset tracking only (keep files)")
print("2. Reset tracking and delete all downloaded files")
print("3. Show current stats only")
print("4. Exit")
choice = input("\nEnter your choice (1-4): ").strip()
if choice == "1":
print("\n🔄 Resetting tracking only...")
reset_karaoke_tracking()
print("\n✅ Tracking reset complete!")
print("You can now re-download files with the new channel parser system.")
print("\nTo re-download, run:")
print("python download_karaoke.py --file data/channels.txt --limit 50")
elif choice == "2":
print("\n🔄 Resetting tracking and deleting files...")
confirm = input("Are you sure you want to delete ALL downloaded files? (yes/no): ").strip().lower()
if confirm == "yes":
reset_karaoke_tracking()
delete_downloaded_files()
print("\n✅ Reset complete! All tracking and files have been removed.")
print("You can now re-download files with the new channel parser system.")
print("\nTo re-download, run:")
print("python download_karaoke.py --file data/channels.txt --limit 50")
else:
print("Operation cancelled.")
elif choice == "3":
print("\n📊 Current statistics:")
show_download_stats()
elif choice == "4":
print("Exiting...")
else:
print("Invalid choice. Please enter 1, 2, 3, or 4.")
if __name__ == "__main__":
main()

View File

@ -1,11 +1,15 @@
import json
from pathlib import Path
from karaoke_downloader.data_path_manager import get_data_path_manager
def normalize_title(title):
normalized = title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip()
return " ".join(normalized.split()).lower()
def load_songlist(songlist_path="data/songList.json"):
def load_songlist(songlist_path=None):
if songlist_path is None:
songlist_path = str(get_data_path_manager().get_songlist_path())
songlist_file = Path(songlist_path)
if not songlist_file.exists():
print(f"⚠️ Songlist file not found: {songlist_path}")
@ -24,14 +28,18 @@ def load_songlist(songlist_path="data/songList.json"):
})
return all_songs
def load_songlist_tracking(tracking_path="data/songlist_tracking.json"):
def load_songlist_tracking(tracking_path=None):
if tracking_path is None:
tracking_path = str(get_data_path_manager().get_songlist_tracking_path())
tracking_file = Path(tracking_path)
if not tracking_file.exists():
return {}
with open(tracking_file, 'r', encoding='utf-8') as f:
return json.load(f)
def load_server_songs(songs_path="data/songs.json"):
def load_server_songs(songs_path=None):
if songs_path is None:
songs_path = str(get_data_path_manager().get_songs_path())
"""Load the list of songs already available on the server."""
songs_file = Path(songs_path)
if not songs_file.exists():