diff --git a/PRD.md b/PRD.md index 299780f..c70a247 100644 --- a/PRD.md +++ b/PRD.md @@ -415,6 +415,103 @@ The codebase has been comprehensively refactored to improve maintainability and - **Flexible control**: Use with limits, parallel processing, and channel targeting - **Clear progress feedback**: Real-time progress tracking for large downloads +## šŸ”§ Recent Bug Fixes & Improvements (v3.4.5) +### **Unified Download Workflow Architecture** +- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines +- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling +- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution +- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation +- **Unified cache management**: Consistent progress tracking and resume functionality across all modes + +### **Architecture Pattern for New Download Modes** +When adding new download modes in the future, follow this pattern to ensure consistency: + +#### **1. Download Plan Building (Mode-Specific)** +Each download mode should build a download plan (list of videos to download) with this structure: +```python +download_plan = [ + { + "video_id": "video_id", + "artist": "artist_name", + "title": "song_title", + "filename": "sanitized_filename.mp4", + "channel_name": "channel_name", + "video_title": "original_video_title", + "force_download": False + } +] +``` + +#### **2. Unified Execution (Shared)** +All modes should use the unified execution workflow: +```python +downloaded_count, success = self.execute_unified_download_workflow( + download_plan=download_plan, + cache_file=cache_file, # Optional, for progress tracking + limit=limit, # Optional, for limiting downloads + show_progress=True, # Optional, for progress display +) +``` + +#### **3. Execution Method Selection (Automatic)** +The unified workflow automatically chooses execution method based on settings: +- **Sequential**: Uses `DownloadPipeline` for single-threaded downloads +- **Parallel**: Uses `ParallelDownloader` when `--parallel` is enabled + +#### **4. Required Implementation Pattern** +```python +def download_new_mode(self, ...): + """New download mode implementation.""" + + # 1. Build download plan (mode-specific logic) + download_plan = [] + for video in videos_to_download: + download_plan.append({ + "video_id": video["id"], + "artist": artist, + "title": title, + "filename": filename, + "channel_name": channel_name, + "video_title": video["title"], + "force_download": force_download + }) + + # 2. Create cache file (optional, for progress tracking) + cache_file = get_download_plan_cache_file("new_mode", **plan_kwargs) + save_plan_cache(cache_file, download_plan, []) + + # 3. Use unified execution workflow + downloaded_count, success = self.execute_unified_download_workflow( + download_plan=download_plan, + cache_file=cache_file, + limit=limit, + show_progress=True, + ) + + return success +``` + +### **Benefits of Unified Architecture** +- **Consistency**: All modes behave identically for execution, progress tracking, and error handling +- **Maintainability**: Changes to download execution only need to be made in one place +- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes +- **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.) +- **Testing**: Easier to test since all modes use the same execution logic + +### **What Was Fixed** +- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies +- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented +- **Code Duplication**: Eliminated duplicate download execution logic across different modes +- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management + +### **Future Development Guidelines** +1. **NEVER implement custom download execution logic** in new download modes +2. **ALWAYS use `execute_unified_download_workflow()`** for download execution +3. **Focus on download plan building** - that's where mode-specific logic belongs +4. **Use the standard download plan structure** for consistency +5. **Implement cache file handling** for progress tracking and resume functionality +6. **Test with both sequential and parallel modes** to ensure compatibility + --- ## šŸš€ Future Enhancements diff --git a/README.md b/README.md index 70c364c..3851580 100644 --- a/README.md +++ b/README.md @@ -49,47 +49,82 @@ The codebase has been comprehensively refactored into a modular architecture wit - **`tracking_cli.py`**: Tracking management CLI ### New Utility Modules (v3.3): -- **`parallel_downloader.py`**: Parallel download management with thread-safe operations - - `ParallelDownloader` class: Manages concurrent downloads with configurable workers - - `DownloadTask` and `DownloadResult` dataclasses: Structured task and result management - - Thread-safe progress tracking and error handling - - Automatic retry mechanism for failed downloads - **`file_utils.py`**: Centralized file operations, filename sanitization, and file validation - - `sanitize_filename()`: Create safe filenames from artist/title - - `generate_possible_filenames()`: Generate filename patterns for different modes - - `check_file_exists_with_patterns()`: Check for existing files using multiple patterns - - `is_valid_mp4_file()`: Validate MP4 files with header checking - - `cleanup_temp_files()`: Remove temporary yt-dlp files - - `ensure_directory_exists()`: Safe directory creation +- **`song_validator.py`**: Centralized song validation logic for checking if songs should be downloaded -- **`song_validator.py`**: Centralized song validation logic - - `SongValidator` class: Unified logic for checking if songs should be downloaded - - `should_skip_song()`: Comprehensive validation with multiple criteria - - `mark_song_failed()`: Consistent failure tracking - - `handle_download_failure()`: Standardized error handling +### **Unified Download Workflow (v3.4.5)** +- **`execute_unified_download_workflow()`**: Centralized download execution that all modes use +- **`_execute_sequential_downloads()`**: Sequential download execution using DownloadPipeline +- **`_execute_parallel_downloads()`**: Parallel download execution using ParallelDownloader -- **Enhanced `config_manager.py`**: Robust configuration management with dataclasses - - `ConfigManager` class: Type-safe configuration loading and caching - - `DownloadSettings`, `FolderStructure`, `LoggingConfig` dataclasses - - Configuration validation and merging with defaults - - Dynamic resolution updates - -### Benefits: +### **Benefits of Enhanced Modular Architecture:** +- **Single Responsibility**: Each module has a focused purpose - **Centralized Utilities**: Common operations (file operations, song validation, yt-dlp commands, error handling) are centralized - **Reduced Duplication**: Eliminated ~150 lines of code duplication across modules +- **Testability**: Individual components can be tested separately +- **Maintainability**: Easier to find and fix issues +- **Reusability**: Components can be used independently +- **Robustness**: Better error handling and interruption recovery - **Consistency**: Standardized error messages and processing pipelines -- **Maintainability**: Changes isolated to specific modules -- **Testability**: Modular components can be tested independently - **Type Safety**: Comprehensive type hints across all new modules +- **Unified Execution**: All download modes use the same execution pipeline for consistency + +## šŸ”§ Development Guidelines + +### **Adding New Download Modes** +When adding new download modes, follow the unified workflow pattern to ensure consistency: + +#### **1. Build Download Plan (Mode-Specific)** +```python +def download_new_mode(self, ...): + # Build download plan with standard structure + download_plan = [] + for video in videos_to_download: + download_plan.append({ + "video_id": video["id"], + "artist": artist, + "title": title, + "filename": filename, + "channel_name": channel_name, + "video_title": video["title"], + "force_download": force_download + }) + + # Use unified execution workflow + downloaded_count, success = self.execute_unified_download_workflow( + download_plan=download_plan, + cache_file=cache_file, + limit=limit, + show_progress=True, + ) + + return success +``` + +#### **2. Key Principles** +- **NEVER implement custom download execution logic** - always use `execute_unified_download_workflow()` +- **Focus on download plan building** - that's where mode-specific logic belongs +- **Use the standard download plan structure** for consistency +- **Implement cache file handling** for progress tracking and resume functionality +- **Test with both sequential and parallel modes** to ensure compatibility + +#### **3. Benefits of Unified Architecture** +- **Consistency**: All modes behave identically for execution, progress tracking, and error handling +- **Automatic Features**: New modes automatically get parallel downloads, progress tracking, and cache management +- **Maintainability**: Changes to download execution only need to be made in one place +- **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes ## šŸ”§ Recent Improvements (v3.4.1) ### **Enhanced Fuzzy Matching** -- **Improved video title parsing**: The `extract_artist_title` function now handles multiple title formats: - - `"Title Karaoke | Artist Karaoke Version"` → Artist: "38 Special", Title: "Hold On Loosely" - - `"Title Artist KARAOKE"` → Attempts to extract artist from complex titles - - `"Artist - Title"` → Standard format (unchanged) -- **Consolidated parsing logic**: All modules now use the same `extract_artist_title` function from `fuzzy_matcher.py` -- **Better matching accuracy**: Reduced false negatives for songs with non-standard title formats +- **Improved title parsing**: Enhanced `extract_artist_title` function to handle multiple video title formats +- **Better matching accuracy**: Can now parse titles like "Hold On Loosely Karaoke | 38 Special Karaoke Version" +- **Consistent parsing**: All modules now use the same parsing logic from `fuzzy_matcher.py` +- **Reduced false negatives**: Songs that previously couldn't be matched due to title format differences now have a higher chance of being found + +### **Fixed Import Conflicts** +- **Resolved import conflicts**: Updated modules to use the enhanced `extract_artist_title` from `fuzzy_matcher.py` +- **Consistent behavior**: All parts of the system use the same parsing logic +- **Cleaner codebase**: Eliminated duplicate code and import conflicts ### **Fixed --limit Parameter** - **Correct limit application**: The `--limit` parameter now properly limits the scanning phase, not just downloads @@ -101,6 +136,27 @@ The codebase has been comprehensively refactored into a modular architecture wit - **Fixed import conflicts**: Resolved inconsistencies between different parsing implementations - **Single source of truth**: All title parsing logic is now centralized in `fuzzy_matcher.py` +## šŸ”§ Recent Improvements (v3.4.5) +### **Unified Download Workflow Architecture** +- **Unified execution pipeline**: All download modes now use the same execution workflow, eliminating inconsistencies and broken pipelines +- **Consistent behavior**: All modes (--channel-focus, --all-videos, --songlist-only, --latest-per-channel) use identical download execution, progress tracking, and error handling +- **Centralized download logic**: Single `execute_unified_download_workflow()` method handles all download execution +- **Automatic parallel support**: All download modes automatically support `--parallel --workers N` without additional implementation +- **Unified cache management**: Consistent progress tracking and resume functionality across all modes + +### **What Was Fixed** +- **Broken Pipeline**: Previously, different modes used different execution paths, leading to inconsistencies +- **Missing Method**: Added missing `download_latest_per_channel()` method that was referenced in CLI but not implemented +- **Code Duplication**: Eliminated duplicate download execution logic across different modes +- **Inconsistent Behavior**: All modes now have identical progress tracking, error handling, and cache management + +### **Benefits** +- āœ… **Consistency**: All modes behave identically for execution, progress tracking, and error handling +- āœ… **Maintainability**: Changes to download execution only need to be made in one place +- āœ… **Reliability**: Eliminates broken pipelines and inconsistent behavior between modes +- āœ… **Extensibility**: New modes automatically get all existing features (parallel downloads, progress tracking, etc.) +- āœ… **Testing**: Easier to test since all modes use the same execution logic + ## šŸ›”ļø Duplicate File Prevention & Filename Consistency (v3.4.2) ### **Duplicate File Prevention** - **Enhanced file existence checking**: Now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates diff --git a/karaoke_downloader/downloader.py b/karaoke_downloader/downloader.py index d69dbff..2b8c2f7 100644 --- a/karaoke_downloader/downloader.py +++ b/karaoke_downloader/downloader.py @@ -443,43 +443,16 @@ class KaraokeDownloader: print(f" šŸŽÆ Limit applied: {limit} videos") print(f" šŸ“ Output directory: downloads/{channel_name}/") print(f" šŸ’¾ Download plan cached to: {cache_file.name}") - print(f"\nšŸŽ¬ Starting downloads...") - # Download videos using the download pipeline - pipeline = DownloadPipeline( - yt_dlp_path=str(self.yt_dlp_path), - config=self.config, - downloads_dir=self.downloads_dir, - songlist_tracking=self.songlist_tracking, - tracker=self.tracker, + # Use unified download workflow + downloaded_count, success = self.execute_unified_download_workflow( + download_plan=download_plan, + cache_file=cache_file, + limit=limit, + show_progress=True, ) - success_count = 0 - total_to_download = len(download_plan) - for i, plan_item in enumerate(download_plan[:], 1): # Use slice to create a copy for iteration - print(f"ā¬‡ļø Downloading {i}/{total_to_download}: {plan_item['artist']} - {plan_item['title']}") - - if pipeline.execute_pipeline( - video_id=plan_item["video_id"], - artist=plan_item["artist"], - title=plan_item["title"], - channel_name=plan_item["channel_name"], - video_title=plan_item["video_title"], - ): - success_count += 1 - # Remove completed item from cache - download_plan.remove(plan_item) # Remove the current item - if download_plan: # If there are still items left - save_plan_cache(cache_file, download_plan, []) - print(f"šŸ—‘ļø Removed completed item from download plan. {len(download_plan)} items remaining.") - else: - # All downloads completed, delete the cache file - from karaoke_downloader.cache_manager import delete_plan_cache - delete_plan_cache(cache_file) - print("šŸ—‘ļø All downloads completed, deleted download plan cache.") - - print(f"\nšŸŽ‰ Download complete! {success_count}/{total_to_download} videos downloaded successfully") - return success_count > 0 + return success def download_songlist_across_channels( self, @@ -718,14 +691,134 @@ class KaraokeDownloader: print(f" ...and {len(unmatched)-DEFAULT_DISPLAY_LIMIT} more.") # --- Download phase --- - downloaded_count, success = self.execute_download_plan_parallel( + downloaded_count, success = self.execute_unified_download_workflow( download_plan=download_plan, - unmatched=unmatched, cache_file=cache_file, limit=limit, ) return success + def download_latest_per_channel( + self, + channel_urls, + limit=None, + force_refresh_download_plan=False, + fuzzy_match=False, + fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD, + force_download=False, + ): + """ + Download the latest N videos from each channel. + + Args: + channel_urls: List of channel URLs to process + limit: Number of latest videos to download from each channel + force_refresh_download_plan: Force refresh the download plan cache + fuzzy_match: Whether to use fuzzy matching + fuzzy_threshold: Threshold for fuzzy matching + force_download: Force download regardless of existing files + + Returns: + bool: True if successful, False otherwise + """ + print(f"\nšŸŽ¬ Downloading latest {limit} videos from {len(channel_urls)} channels") + + # Build download plan for latest videos from each channel + download_plan = [] + total_videos_found = 0 + + for i, channel_url in enumerate(channel_urls, 1): + print(f"\n🚦 Processing channel {i}/{len(channel_urls)}: {channel_url}") + + # Get channel info + channel_name, channel_id = get_channel_info(channel_url) + print(f" āœ… Channel: {channel_name}") + + # Get videos from channel + available_videos = self.tracker.get_channel_video_list( + channel_url, + str(self.yt_dlp_path), + force_refresh=False + ) + + if not available_videos: + print(f" āš ļø No videos found for {channel_name}") + continue + + print(f" šŸ“Š Found {len(available_videos)} videos") + + # Take the latest N videos (they're already sorted by date) + latest_videos = available_videos[:limit] if limit else available_videos + print(f" šŸŽÆ Processing latest {len(latest_videos)} videos") + + # Process each video + for video in latest_videos: + video_title = video["title"] + video_id = video["id"] + + # Extract artist and title + artist, extracted_title = self.channel_parser.extract_artist_title(video_title, channel_name) + if not artist and not extracted_title: + # Fallback: use the full title + artist = "" + extracted_title = video_title + + # Create filename + filename = sanitize_filename(artist, extracted_title) + + # Add to download plan + download_plan.append({ + "video_id": video_id, + "artist": artist, + "title": extracted_title, + "filename": filename, + "channel_name": channel_name, + "video_title": video_title, + "force_download": force_download + }) + + total_videos_found += 1 + + print(f"\nšŸ“‹ Download plan created: {total_videos_found} videos from {len(channel_urls)} channels") + + if not download_plan: + print("āŒ No videos to download") + return False + + # Create cache file for progress tracking + import hashlib + from karaoke_downloader.cache_manager import get_download_plan_cache_file, save_plan_cache + + plan_kwargs = { + "mode": "latest_per_channel", + "channels": len(channel_urls), + "limit_per_channel": limit, + "force_download": force_download, + } + + # Add channel URLs hash to ensure same channels = same cache + channels_hash = hashlib.md5( + "|".join(sorted(channel_urls)).encode() + ).hexdigest()[:8] + plan_kwargs["channels_hash"] = channels_hash + + cache_file = get_download_plan_cache_file("latest_per_channel", **plan_kwargs) + + # Save the plan to cache + save_plan_cache(cache_file, download_plan, []) # No unmatched for latest-per-channel mode + + print(f"šŸ’¾ Download plan cached to: {cache_file.name}") + + # Use unified download workflow + downloaded_count, success = self.execute_unified_download_workflow( + download_plan=download_plan, + cache_file=cache_file, + limit=None, # Limit already applied during plan building + show_progress=True, + ) + + return success + def _process_videos_for_download(self, available_videos, channel_name, force_refresh=False, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD, force_download=False): """Process videos for download (used for both manual and regular channels).""" songlist = load_songlist(self.songlist_file_path) @@ -841,49 +934,39 @@ class KaraokeDownloader: matches = matches[:limit] print(f"šŸŽÆ Limiting to {len(matches)} downloads") - # Download matched videos - success_count = 0 - for i, match in enumerate(matches, 1): + # Convert matches to a download plan + download_plan = [] + for match in matches: video = match["video"] song = match["song"] artist = match["artist"] title = match["title"] video_id = video["id"] - print(f"\nā¬‡ļø Downloading {i}/{len(matches)}: {artist} - {title}") - print(f" šŸŽ¬ Video: {video['title']} ({channel_name})") - if match["match_type"] == "fuzzy": - print(f" šŸŽÆ Match Score: {match['match_score']:.1f}%") - # Create filename filename = sanitize_filename(artist, title) output_path = self.downloads_dir / channel_name / filename - # Use the download pipeline - pipeline = DownloadPipeline( - yt_dlp_path=str(self.yt_dlp_path), - config=self.config, - downloads_dir=self.downloads_dir, - songlist_tracking=self.songlist_tracking, - tracker=self.tracker, - ) - - success = pipeline.execute_pipeline( - video_id=video_id, - artist=artist, - title=title, - channel_name=channel_name, - video_title=video["title"] - ) - - if success: - success_count += 1 - print(f"āœ… Successfully downloaded: {artist} - {title}") - else: - print(f"āŒ Failed to download: {artist} - {title}") + # Add to download plan + download_plan.append({ + "video_id": video_id, + "artist": artist, + "title": title, + "filename": filename, + "channel_name": channel_name, + "video_title": video["title"], + "force_download": force_download + }) - print(f"\nšŸŽ‰ Download complete! {success_count}/{len(matches)} videos downloaded successfully") - return success_count > 0 + # Use the unified download workflow + downloaded_count, success = self.execute_unified_download_workflow( + download_plan=download_plan, + cache_file=None, # No specific cache file for this mode + limit=limit, + show_progress=True, + ) + + return success def _download_single_video(self, video, channel_name, filename, force_download=False): """Download a single video using the download pipeline.""" @@ -923,6 +1006,138 @@ class KaraokeDownloader: return success + def execute_unified_download_workflow( + self, + download_plan, + cache_file=None, + limit=None, + show_progress=True, + ): + """ + Unified download workflow that all download modes use. + + Args: + download_plan: List of download items with video_id, artist, title, channel_name, video_title + cache_file: Optional cache file for progress tracking + limit: Optional limit on number of downloads + show_progress: Whether to show progress information + + Returns: + tuple: (downloaded_count, success) + """ + if not download_plan: + print("šŸ“‹ No videos to download in plan") + return 0, True + + total_to_download = len(download_plan) + if limit: + total_to_download = min(limit, total_to_download) + download_plan = download_plan[:limit] + + if show_progress: + print(f"\nšŸŽ¬ Starting downloads: {total_to_download} videos") + print(f" šŸ“ Output directory: downloads/") + if cache_file: + print(f" šŸ’¾ Progress tracking: {cache_file.name}") + + # Choose execution method based on parallel settings + if self.enable_parallel_downloads: + return self._execute_parallel_downloads(download_plan, cache_file, show_progress) + else: + return self._execute_sequential_downloads(download_plan, cache_file, show_progress) + + def _execute_sequential_downloads(self, download_plan, cache_file, show_progress): + """Execute downloads sequentially using the download pipeline.""" + success_count = 0 + total_to_download = len(download_plan) + + # Create download pipeline + pipeline = DownloadPipeline( + yt_dlp_path=str(self.yt_dlp_path), + config=self.config, + downloads_dir=self.downloads_dir, + songlist_tracking=self.songlist_tracking, + tracker=self.tracker, + ) + + for i, plan_item in enumerate(download_plan, 1): + if show_progress: + print(f"\nā¬‡ļø Downloading {i}/{total_to_download}: {plan_item['artist']} - {plan_item['title']}") + print(f" šŸŽ¬ Video: {plan_item['video_title']} ({plan_item['channel_name']})") + + success = pipeline.execute_pipeline( + video_id=plan_item["video_id"], + artist=plan_item["artist"], + title=plan_item["title"], + channel_name=plan_item["channel_name"], + video_title=plan_item["video_title"], + ) + + if success: + success_count += 1 + if show_progress: + print(f"āœ… Successfully downloaded: {plan_item['artist']} - {plan_item['title']}") + else: + if show_progress: + print(f"āŒ Failed to download: {plan_item['artist']} - {plan_item['title']}") + + # Update cache if provided + if cache_file: + # Remove completed item from plan and update cache + download_plan.remove(plan_item) + from karaoke_downloader.cache_manager import save_plan_cache + save_plan_cache(cache_file, download_plan, []) # No unmatched for unified workflow + + if not download_plan: # All downloads completed + from karaoke_downloader.cache_manager import delete_plan_cache + delete_plan_cache(cache_file) + if show_progress: + print("šŸ—‘ļø All downloads completed, deleted download plan cache.") + + if show_progress: + print(f"\nšŸŽ‰ Download complete! {success_count}/{total_to_download} videos downloaded successfully") + + return success_count, success_count > 0 + + def _execute_parallel_downloads(self, download_plan, cache_file, show_progress): + """Execute downloads in parallel using the parallel downloader.""" + from karaoke_downloader.parallel_downloader import create_parallel_downloader + + # Create parallel downloader + parallel_downloader = create_parallel_downloader( + yt_dlp_path=str(self.yt_dlp_path), + config=self.config, + downloads_dir=self.downloads_dir, + max_workers=self.parallel_workers, + songlist_tracking=self.songlist_tracking, + tracker=self.tracker, + ) + + # Convert download plan to tasks + tasks = [] + for item in download_plan: + from karaoke_downloader.parallel_downloader import DownloadTask + task = DownloadTask( + video_id=item["video_id"], + artist=item["artist"], + title=item["title"], + channel_name=item["channel_name"], + video_title=item["video_title"], + ) + tasks.append(task) + + # Execute parallel downloads + results = parallel_downloader.execute_downloads(tasks) + + # Count successes + success_count = sum(1 for result in results if result.success) + total_to_download = len(tasks) + + if show_progress: + print(f"\nšŸŽ‰ Parallel download complete! {success_count}/{total_to_download} videos downloaded successfully") + + return success_count, success_count > 0 + def reset_songlist_all(): """Delete all files tracked in songlist_tracking.json, clear songlist_tracking.json, and remove songlist songs from karaoke_tracking.json."""