Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

This commit is contained in:
mbrucedogs 2025-07-29 08:45:12 -05:00
parent ec95b24a69
commit 42e7a6a09c
9 changed files with 8490 additions and 592 deletions

42
PRD.md
View File

@ -197,6 +197,8 @@ KaroakeVideoDownloader/
- `--parallel`: **Enable parallel downloads for improved speed** - `--parallel`: **Enable parallel downloads for improved speed**
- `--workers <N>`: **Number of parallel download workers (1-10, default: 3, only used with --parallel)** - `--workers <N>`: **Number of parallel download workers (1-10, default: 3, only used with --parallel)**
- `--manual`: **Download from manual videos collection (data/manual_videos.json)** - `--manual`: **Download from manual videos collection (data/manual_videos.json)**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files and songs in songs.json**
--- ---
@ -389,3 +391,43 @@ The codebase has been comprehensively refactored to improve maintainability and
- **Update documentation immediately when implementing new features** - **Update documentation immediately when implementing new features**
- **Remove outdated information and consolidate related changes** - **Remove outdated information and consolidate related changes**
- **Ensure all CLI options and features are documented in both files** - **Ensure all CLI options and features are documented in both files**
## 🔧 Recent Bug Fixes & Improvements (v3.4.4)
### **All Videos Download Mode**
- **New `--all-videos` parameter**: Download all videos from a channel, not just songlist matches
- **Smart MP3/MP4 detection**: Automatically detects if you have MP3 versions in songs.json and downloads MP4 video versions
- **Existing file skipping**: Skips videos that already exist on the filesystem
- **Progress tracking**: Shows clear progress with "Downloading X/Y videos" format
- **Parallel processing support**: Works with `--parallel --workers N` for faster downloads
- **Channel focus integration**: Works with `--channel-focus` to target specific channels
- **Limit support**: Works with `--limit N` to control download batch size
### **Smart Songlist Integration**
- **MP4 version detection**: Checks if MP4 version already exists in songs.json before downloading
- **MP3 upgrade path**: Downloads MP4 video versions when only MP3 versions exist in songlist
- **Duplicate prevention**: Skips downloads when MP4 versions already exist
- **Efficient filtering**: Only processes videos that need to be downloaded
### **Benefits of All Videos Mode**
- **Complete channel downloads**: Download entire channels without songlist restrictions
- **Automatic format upgrading**: Upgrade MP3 collections to MP4 video versions
- **Efficient processing**: Only downloads videos that don't already exist
- **Flexible control**: Use with limits, parallel processing, and channel targeting
- **Clear progress feedback**: Real-time progress tracking for large downloads
---
## 🚀 Future Enhancements
- [ ] Web UI for easier management
- [ ] More advanced song matching (multi-language)
- [ ] Download scheduling and retry logic
- [ ] More granular status reporting
- [x] **Parallel downloads for improved speed** ✅ **COMPLETED**
- [x] **Enhanced fuzzy matching with improved video title parsing** ✅ **COMPLETED**
- [x] **Consolidated extract_artist_title function** ✅ **COMPLETED**
- [x] **Duplicate file prevention and filename consistency** ✅ **COMPLETED**
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows
- [ ] Plugin system for custom file operations
- [ ] Advanced configuration UI
- [ ] Real-time download progress visualization

View File

@ -145,6 +145,21 @@ python data/cleanup_duplicate_files.py
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos
``` ```
### Download ALL Videos from a Channel (Not Just Songlist Matches)
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
```
### Download ALL Videos with Parallel Processing
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
```
### Download ALL Videos with Limit
```bash
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
```
### Download Only Songlist Songs (Fast Mode) ### Download Only Songlist Songs (Fast Mode)
```bash ```bash
python download_karaoke.py --songlist-only --limit 5 python download_karaoke.py --songlist-only --limit 5
@ -325,6 +340,8 @@ KaroakeVideoDownloader/
- `--generate-songlist <DIR1> <DIR2>...`: **Generate song list from MP4 files with ID3 tags in specified directories** - `--generate-songlist <DIR1> <DIR2>...`: **Generate song list from MP4 files with ID3 tags in specified directories**
- `--no-append-songlist`: **Create a new song list instead of appending when using --generate-songlist** - `--no-append-songlist`: **Create a new song list instead of appending when using --generate-songlist**
- `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary** - `--force`: **Force download from channels, bypassing all existing file checks and re-downloading if necessary**
- `--channel-focus <CHANNEL_NAME>`: **Download from a specific channel by name (e.g., 'SingKingKaraoke')**
- `--all-videos`: **Download all videos from channel (not just songlist matches), skipping existing files**
## 📝 Example Usage ## 📝 Example Usage
@ -358,6 +375,11 @@ python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates python download_karaoke.py --clear-server-duplicates
# Download ALL videos from a specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Song list generation from MP4 files # Song list generation from MP4 files
python download_karaoke.py --generate-songlist /path/to/mp4/directory python download_karaoke.py --generate-songlist /path/to/mp4/directory
python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist python download_karaoke.py --generate-songlist /path/to/dir1 /path/to/dir2 --no-append-songlist

View File

@ -1,6 +1,6 @@
# 🎤 Karaoke Video Downloader - CLI Commands Reference # 🎤 Karaoke Video Downloader - CLI Commands Reference
# Copy and paste these commands into your terminal # Copy and paste these commands into your terminal
# Updated: v3.4.3 (includes manual video collection, channel parsing rules, and all previous improvements) # Updated: v3.4.4 (includes all videos download mode, manual video collection, channel parsing rules, and all previous improvements)
## 📥 BASIC DOWNLOADS ## 📥 BASIC DOWNLOADS
@ -45,6 +45,26 @@ python add_manual_video.py list
# Remove a video from manual collection # Remove a video from manual collection
python add_manual_video.py remove "Artist - Song Title (Karaoke Version)" python add_manual_video.py remove "Artist - Song Title (Karaoke Version)"
## 🎬 ALL VIDEOS DOWNLOAD MODE (v3.4.4)
# Download ALL videos from a specific channel (not just songlist matches)
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos
# Download ALL videos with parallel processing for speed
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 10
# Download ALL videos with limit (download first N videos)
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --limit 100
# Download ALL videos with parallel processing and limit
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --parallel --workers 5 --limit 50
# Download ALL videos from ZoomKaraokeOfficial channel
python download_karaoke.py --channel-focus ZoomKaraokeOfficial --all-videos
# Download ALL videos with custom resolution
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos --resolution 1080p
## 📋 SONG LIST GENERATION ## 📋 SONG LIST GENERATION
# Generate song list from MP4 files in a directory (append to existing song list) # Generate song list from MP4 files in a directory (append to existing song list)

View File

@ -20038,6 +20038,6 @@
"id": "iPt5vUHVROI" "id": "iPt5vUHVROI"
} }
], ],
"last_updated": "2025-07-27T20:36:32.831332", "last_updated": "2025-07-28T20:36:32.831332",
"video_count": 5009 "video_count": 5009
} }

View File

@ -38974,6 +38974,6 @@
"id": "_38gtxtjvPI" "id": "_38gtxtjvPI"
} }
], ],
"last_updated": "2025-07-27T20:50:52.833995", "last_updated": "2025-07-28T20:50:52.833995",
"video_count": 9743 "video_count": 9743
} }

View File

@ -26,6 +26,46 @@
"upload_date": "2024-01-01", "upload_date": "2024-01-01",
"duration": 180, "duration": 180,
"view_count": 1000 "view_count": 1000
},
{
"title": "AC/DC - Big Balls",
"url": "https://www.youtube.com/watch?v=kiSDpVmu4Bk",
"id": "kiSDpVmu4Bk",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Jon Bon Jovi - Blaze of Glory",
"url": "https://www.youtube.com/watch?v=SzRAoDMlQY",
"id": "SzRAoDMlQY",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "ZZ Top - Sharp Dressed Man",
"url": "https://www.youtube.com/watch?v=prRalwto9iY",
"id": "prRalwto9iY",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Nickelback - Photograph",
"url": "https://www.youtube.com/watch?v=qTphCTAUhUg",
"id": "qTphCTAUhUg",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
},
{
"title": "Billy Joel - Shes Got A Way",
"url": "https://www.youtube.com/watch?v=DeeTFIgKuC8",
"id": "DeeTFIgKuC8",
"upload_date": "2024-01-01",
"duration": 180,
"view_count": 1000
} }
], ],
"parsing_rules": { "parsing_rules": {

File diff suppressed because it is too large Load Diff

View File

@ -61,35 +61,37 @@ def load_channels_from_text(channels_file: str = "data/channels.txt") -> List[st
def load_channels(channel_file: str = None) -> List[str]: def load_channels(channel_file: str = None) -> List[str]:
""" """Load channel URLs from file."""
Load channel URLs from either JSON or text format. if channel_file is None:
# Try JSON first, then fall back to text
Args: if os.path.exists("data/channels.json"):
channel_file: Path to the channel file (optional) return load_channels_from_json("data/channels.json")
elif os.path.exists("data/channels.txt"):
Returns: return load_channels_from_text("data/channels.txt")
List of channel URLs else:
""" return []
if channel_file: else:
# Use the specified file if channel_file.endswith(".json"):
if channel_file.endswith('.json'):
return load_channels_from_json(channel_file) return load_channels_from_json(channel_file)
else: else:
return load_channels_from_text(channel_file) return load_channels_from_text(channel_file)
else:
# Try JSON first, then fall back to text
json_file = "data/channels.json" def get_channel_url_by_name(channel_name: str) -> str:
txt_file = "data/channels.txt" """Look up a channel URL by its name from the channels configuration."""
channel_urls = load_channels()
if os.path.exists(json_file):
print(f"📋 Using new JSON format: {json_file}") # Normalize the channel name for comparison
return load_channels_from_json(json_file) normalized_name = channel_name.lower().replace("@", "").replace("karaoke", "").strip()
elif os.path.exists(txt_file):
print(f"📋 Using legacy text format: {txt_file}") for url in channel_urls:
return load_channels_from_text(txt_file) # Extract channel name from URL
else: if "/@" in url:
print("❌ No channel file found. Please create data/channels.json or data/channels.txt") url_channel_name = url.split("/@")[1].split("/")[0].lower()
return [] if url_channel_name == normalized_name or url_channel_name.replace("karaoke", "").strip() == normalized_name:
return url
return None
def main(): def main():
@ -100,9 +102,10 @@ def main():
Examples: Examples:
python download_karaoke.py --limit 10 # Download latest 10 videos from all channels python download_karaoke.py --limit 10 # Download latest 10 videos from all channels
python download_karaoke.py --songlist-only --limit 10 # Download only songlist songs across channels python download_karaoke.py --songlist-only --limit 10 # Download only songlist songs across channels
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos # Download from specific channel python download_karaoke.py --channel-focus SingKingKaraoke --limit 5 # Download from specific channel
python download_karaoke.py --channel-focus SingKingKaraoke --all-videos # Download ALL videos from channel
python download_karaoke.py https://www.youtube.com/@SingKingKaraoke/videos # Download from specific channel URL
python download_karaoke.py --file data/channels.txt # Download from custom channel list python download_karaoke.py --file data/channels.txt # Download from custom channel list
python download_karaoke.py --manual --limit 5 # Download from manual videos collection
python download_karaoke.py --reset-channel SingKingKaraoke --delete-files python download_karaoke.py --reset-channel SingKingKaraoke --delete-files
""", """,
) )
@ -298,6 +301,16 @@ Examples:
action="store_true", action="store_true",
help="Download from manual videos collection (data/manual_videos.json)", help="Download from manual videos collection (data/manual_videos.json)",
) )
parser.add_argument(
"--channel-focus",
type=str,
help="Download from a specific channel by name (e.g., 'SingKingKaraoke')",
)
parser.add_argument(
"--all-videos",
action="store_true",
help="Download all videos from channel (not just songlist matches), skipping existing files",
)
args = parser.parse_args() args = parser.parse_args()
# Validate workers argument # Validate workers argument
@ -428,9 +441,10 @@ Examples:
print("💾 Channel Cache Information") print("💾 Channel Cache Information")
print("=" * 40) print("=" * 40)
print(f"Total Channels: {cache_info['total_channels']}") print(f"Total Channels: {cache_info['total_channels']}")
print(f"Total Cached Videos: {cache_info['total_cached_videos']}") print(f"Total Cached Videos: {cache_info['total_videos']}")
print(f"Cache Duration: {cache_info['cache_duration_hours']} hours") print("\n📋 Channel Details:")
print(f"Last Updated: {cache_info['last_updated']}") for channel in cache_info['channels']:
print(f"{channel['channel']}: {channel['videos']} videos (updated: {channel['last_updated']})")
sys.exit(0) sys.exit(0)
elif args.clear_cache: elif args.clear_cache:
if args.clear_cache == "all": if args.clear_cache == "all":
@ -478,7 +492,42 @@ Examples:
force_refresh=args.refresh, force_refresh=args.refresh,
fuzzy_match=args.fuzzy_match, fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold, fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
) )
elif args.channel_focus:
# Download from a specific channel by name
print(f"🎤 Looking up channel: {args.channel_focus}")
channel_url = get_channel_url_by_name(args.channel_focus)
if not channel_url:
print(f"❌ Channel '{args.channel_focus}' not found in configuration")
print("Available channels:")
channel_urls = load_channels()
for url in channel_urls:
if "/@" in url:
channel_name = url.split("/@")[1].split("/")[0]
print(f"{channel_name}")
sys.exit(1)
if args.all_videos:
# Download ALL videos from the channel (not just songlist matches)
print(f"🎤 Downloading ALL videos from channel: {args.channel_focus} ({channel_url})")
success = downloader.download_all_channel_videos(
channel_url,
force_refresh=args.refresh,
force_download=args.force,
limit=args.limit,
)
else:
# Download only songlist matches from the channel
print(f"🎤 Downloading from channel: {args.channel_focus} ({channel_url})")
success = downloader.download_channel_videos(
channel_url,
force_refresh=args.refresh,
fuzzy_match=args.fuzzy_match,
fuzzy_threshold=args.fuzzy_threshold,
force_download=args.force,
)
elif args.songlist_only or args.songlist_focus: elif args.songlist_only or args.songlist_focus:
# Use provided file or default to channels configuration # Use provided file or default to channels configuration
channel_urls = load_channels(args.file) channel_urls = load_channels(args.file)

View File

@ -185,6 +185,7 @@ class KaraokeDownloader:
force_refresh=False, force_refresh=False,
fuzzy_match=False, fuzzy_match=False,
fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD,
force_download=False,
): ):
"""Download videos from a channel or playlist URL, respecting songlist-only and limit flags. Supports fuzzy matching.""" """Download videos from a channel or playlist URL, respecting songlist-only and limit flags. Supports fuzzy matching."""
@ -213,111 +214,177 @@ class KaraokeDownloader:
print(f"📋 Found {len(available_videos)} manual videos") print(f"📋 Found {len(available_videos)} manual videos")
# Process manual videos (skip yt-dlp) # Process manual videos (skip yt-dlp)
return self._process_videos_for_download(available_videos, channel_name, force_refresh, fuzzy_match, fuzzy_threshold) return self._process_videos_for_download(available_videos, channel_name, force_refresh, fuzzy_match, fuzzy_threshold, force_download)
# Regular YouTube channel processing # Regular YouTube channel processing - USE TRACKING MANAGER CACHE
channel_name, channel_id = get_channel_info(url) channel_name, channel_id = get_channel_info(url)
print(f"\n🎬 Downloading from channel: {channel_name} ({url})") print(f"\n🎬 Downloading from channel: {channel_name} ({url})")
songlist = load_songlist(self.songlist_file_path)
if not songlist: # Use tracking manager to get videos (with cache support)
print("⚠️ No songlist loaded. Skipping.") available_videos = self.tracker.get_channel_video_list(
url,
str(self.yt_dlp_path),
force_refresh=force_refresh
)
if not available_videos:
print("⚠️ No videos found for this channel. Skipping.")
return False return False
print(f"📋 Found {len(available_videos)} videos from channel")
# Process videos for download
return self._process_videos_for_download(available_videos, channel_name, force_refresh, fuzzy_match, fuzzy_threshold, force_download)
def download_all_channel_videos(
self,
url,
force_refresh=False,
force_download=False,
limit=None,
):
"""Download ALL videos from a channel, skipping existing files and songs in songs.json."""
# Check if this is a manual channel
from karaoke_downloader.manual_video_manager import is_manual_channel, get_manual_channel_info, get_manual_videos_for_channel
if is_manual_channel(url):
channel_name, channel_id = get_manual_channel_info(url)
print(f"\n🎬 Downloading ALL videos from manual channel: {channel_name} ({url})")
# Load manual videos
manual_videos = get_manual_videos_for_channel(channel_name)
if not manual_videos:
print("⚠️ No manual videos found. Skipping.")
return False
# Convert to the expected format
available_videos = []
for video in manual_videos:
available_videos.append({
"title": video.get("title", ""),
"id": video.get("id", ""),
"url": video.get("url", "")
})
print(f"📋 Found {len(available_videos)} manual videos")
# Process all manual videos (skip songlist filtering)
return self._process_all_videos_for_download(available_videos, channel_name, force_refresh, force_download, limit)
# Regular YouTube channel processing - USE TRACKING MANAGER CACHE
channel_name, channel_id = get_channel_info(url)
print(f"\n🎬 Downloading ALL videos from channel: {channel_name} ({url})")
# Use tracking manager to get videos (with cache support)
available_videos = self.tracker.get_channel_video_list(
url,
str(self.yt_dlp_path),
force_refresh=force_refresh
)
if not available_videos:
print("⚠️ No videos found for this channel. Skipping.")
return False
print(f"📋 Found {len(available_videos)} videos from channel")
# Process all videos for download (skip songlist filtering)
return self._process_all_videos_for_download(available_videos, channel_name, force_refresh, force_download, limit)
def _process_all_videos_for_download(self, available_videos, channel_name, force_refresh=False, force_download=False, limit=None):
"""Process ALL videos for download (no songlist filtering)."""
print(f"🔍 Processing {len(available_videos)} videos for download (no songlist filtering)...")
# Load server songs and duplicates tracking for availability checking # Load server songs and duplicates tracking for availability checking
server_songs = load_server_songs() server_songs = load_server_songs()
server_duplicates_tracking = load_server_duplicates_tracking() server_duplicates_tracking = load_server_duplicates_tracking()
limit = getattr(self.config, "limit", 1) # Load songlist to check if songs are already in songs.json
cmd = [ songlist = load_songlist(self.songlist_file_path)
str(self.yt_dlp_path), songlist_artists_titles = set()
"--flat-playlist", if songlist:
"--print", for song in songlist:
"%(title)s|%(id)s|%(url)s", song_key = f"{song['artist']} - {song['title']}"
url, songlist_artists_titles.add(song_key)
]
try: # Process all videos, skipping existing files and songs in songs.json
result = subprocess.run(cmd, capture_output=True, text=True, check=True) videos_to_download = []
lines = result.stdout.strip().splitlines() skipped_count = 0
except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed to fetch playlist: {e}")
return False
available_videos = []
for line in lines:
parts = line.split("|")
if len(parts) >= 2:
title, video_id = parts[0].strip(), parts[1].strip()
available_videos.append({"title": title, "id": video_id})
# Normalize songlist for matching
normalized_songlist = {
create_song_key(s["artist"], s["title"]): s for s in songlist
}
matches = []
similarity = get_similarity_function()
for video in available_videos: for video in available_videos:
artist, title = self.channel_parser.extract_artist_title(video["title"], channel_name) video_title = video["title"]
key = create_song_key(artist, title) video_id = video["id"]
if fuzzy_match:
# Fuzzy match against all songlist keys # Extract artist and title
best_score = 0 artist, extracted_title = self.channel_parser.extract_artist_title(video_title, channel_name)
best_song = None if not artist and not extracted_title:
for song_key, song in normalized_songlist.items(): # Fallback: use the full title
score = similarity(key, song_key) artist = ""
if score > best_score: extracted_title = video_title
best_score = score
best_song = song # Create filename
if best_score >= fuzzy_threshold and best_song: filename = f"{artist} - {extracted_title}.mp4" if artist and extracted_title else f"{video_title}.mp4"
# Check if already downloaded or on server (unless force mode is enabled) output_path = self.downloads_dir / channel_name / filename
if self.force_download or not is_songlist_song_downloaded(
self.songlist_tracking, best_song["artist"], best_song["title"] # Check if file already exists
): if output_path.exists() and not force_download:
# Check if already marked as server duplicate (unless force mode is enabled) print(f" ⏭️ Skipping existing file: {filename}")
if self.force_download or not is_song_marked_as_server_duplicate( skipped_count += 1
server_duplicates_tracking, continue
best_song["artist"],
best_song["title"], # Check if song is already in songs.json
): song_key = f"{artist} - {extracted_title}"
# Check if already on server and mark for future skipping (unless force mode is enabled) if song_key in songlist_artists_titles and not force_download:
if self.force_download or not check_and_mark_server_duplicate( # Check if there's already an MP4 version in songs.json
server_songs, if song_key in server_songs:
server_duplicates_tracking, song_info = server_songs[song_key]
best_song["artist"], if song_info.get("is_mp4", False):
best_song["title"], print(f" ⏭️ Skipping song in songs.json (MP4): {song_key}")
video["title"], skipped_count += 1
channel_name, continue
): else:
matches.append((video, best_song)) print(f" 📥 Found MP3 version in songs.json, will download MP4: {song_key}")
print( # Continue to download the MP4 version
f" → Fuzzy match: {artist} - {title} <-> {best_song['artist']} - {best_song['title']} (score: {best_score})" else:
) print(f" 📥 Found song in songlist, will download MP4: {song_key}")
if len(matches) >= limit: # Continue to download the MP4 version
break
else: # Check if already downloaded (tracking)
if key in normalized_songlist: if self.tracker.is_song_downloaded(artist, extracted_title, channel_name, video_id) and not force_download:
song = normalized_songlist[key] print(f" ⏭️ Skipping already tracked: {song_key}")
# Check if already downloaded or on server (unless force mode is enabled) skipped_count += 1
if self.force_download or not is_songlist_song_downloaded( continue
self.songlist_tracking, song["artist"], song["title"]
): # Check if on server
# Check if already marked as server duplicate (unless force mode is enabled) if not force_download and check_and_mark_server_duplicate(
if self.force_download or not is_song_marked_as_server_duplicate( server_songs,
server_duplicates_tracking, song["artist"], song["title"] server_duplicates_tracking,
): artist,
# Check if already on server and mark for future skipping (unless force mode is enabled) extracted_title,
if self.force_download or not check_and_mark_server_duplicate( video_title,
server_songs, channel_name,
server_duplicates_tracking, ):
song["artist"], print(f" ⏭️ Skipping server duplicate: {song_key}")
song["title"], skipped_count += 1
video["title"], continue
channel_name,
): # Add to download list
matches.append((video, song)) videos_to_download.append((video, artist, extracted_title, filename))
if len(matches) >= limit:
break print(f"📊 Found {len(videos_to_download)} videos to download (skipped {skipped_count})")
if not matches:
print("🎵 No new songlist matches found for this channel.") # Apply limit if specified
if limit is not None:
original_count = len(videos_to_download)
videos_to_download = videos_to_download[:limit]
print(f"🎯 Limited to first {limit} videos (was {original_count} total)")
if not videos_to_download:
print("🎵 No new videos to download.")
return True return True
# Download only the first N matches using the new pipeline
# Download videos using the download pipeline
pipeline = DownloadPipeline( pipeline = DownloadPipeline(
yt_dlp_path=str(self.yt_dlp_path), yt_dlp_path=str(self.yt_dlp_path),
config=self.config, config=self.config,
@ -325,11 +392,11 @@ class KaraokeDownloader:
songlist_tracking=self.songlist_tracking, songlist_tracking=self.songlist_tracking,
tracker=self.tracker, tracker=self.tracker,
) )
for video, song in matches: success_count = 0
artist, title = song["artist"], song["title"] for i, (video, artist, title, filename) in enumerate(videos_to_download, 1):
print(f"🎵 Processing: {artist} - {title}") print(f"⬇️ Downloading {i}/{len(videos_to_download)}: {artist} - {title}")
if pipeline.execute_pipeline( if pipeline.execute_pipeline(
video_id=video["id"], video_id=video["id"],
artist=artist, artist=artist,
@ -337,10 +404,13 @@ class KaraokeDownloader:
channel_name=channel_name, channel_name=channel_name,
video_title=video.get("title", ""), video_title=video.get("title", ""),
): ):
print(f"✅ Successfully processed: {artist} - {title}") print(f"✅ Successfully downloaded: {artist} - {title}")
success_count += 1
else: else:
print(f"❌ Failed to process: {artist} - {title}") print(f"❌ Failed to download: {artist} - {title}")
return True
print(f"\n🎉 Download complete! {success_count}/{len(videos_to_download)} videos downloaded successfully")
return success_count > 0
def download_songlist_across_channels( def download_songlist_across_channels(
self, self,
@ -587,463 +657,40 @@ class KaraokeDownloader:
) )
return success return success
def execute_download_plan_parallel( def _process_videos_for_download(self, available_videos, channel_name, force_refresh=False, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD, force_download=False):
self,
download_plan,
unmatched,
cache_file,
limit=None,
):
"""
Execute a download plan using parallel downloads.
Args:
download_plan: List of download items to process
unmatched: List of unmatched songs
cache_file: Path to cache file for progress tracking
limit: Optional limit on number of downloads
Returns:
tuple: (downloaded_count, success)
"""
if not self.enable_parallel_downloads:
# Fall back to sequential downloads
return execute_download_plan(
download_plan=download_plan,
unmatched=unmatched,
cache_file=cache_file,
config=self.config,
yt_dlp_path=self.yt_dlp_path,
downloads_dir=self.downloads_dir,
songlist_tracking=self.songlist_tracking,
limit=limit,
)
# Create parallel downloader
parallel_downloader = create_parallel_downloader(
yt_dlp_path=str(self.yt_dlp_path),
config=self.config,
downloads_dir=self.downloads_dir,
max_workers=self.parallel_workers,
songlist_tracking=self.songlist_tracking,
tracker=self.tracker,
)
# Convert download plan to tasks
tasks = []
total_to_download = limit if limit is not None else len(download_plan)
for idx, item in enumerate(download_plan):
if limit is not None and len(tasks) >= limit:
break
task = DownloadTask(
video_id=item["video_id"],
artist=item["artist"],
title=item["title"],
channel_name=item["channel_name"],
video_title=item.get("video_title"),
priority=len(download_plan) - idx, # Higher priority for earlier items
)
tasks.append(task)
print(f"🚀 Starting parallel downloads with {self.parallel_workers} workers...")
print(f"📋 Total tasks: {len(tasks)}")
# Add all tasks to the parallel downloader
parallel_downloader.add_download_tasks(tasks)
# Execute parallel downloads
results = parallel_downloader.execute_downloads(show_progress=True)
# Process results and update tracking
successful_downloads = 0
for result in results:
if result.success:
successful_downloads += 1
# Remove completed item from plan and update cache
for idx, item in enumerate(download_plan):
if (item["video_id"] == result.task.video_id and
item["artist"] == result.task.artist and
item["title"] == result.task.title):
download_plan.pop(idx)
break
# Update cache after each successful download
save_plan_cache(cache_file, download_plan, unmatched)
print(f"🎉 Parallel downloads completed!")
print(f" ✅ Successful: {successful_downloads}")
print(f" ❌ Failed: {len(results) - successful_downloads}")
# Delete cache if all items are complete
if len(download_plan) == 0:
cleanup_cache(cache_file)
return successful_downloads, True
def execute_latest_per_channel_parallel(
self,
channel_plans,
cache_file,
):
"""
Execute latest-per-channel downloads using parallel processing.
Args:
channel_plans: List of channel plans with videos to download
cache_file: Path to cache file for progress tracking
Returns:
bool: True if successful
"""
if not self.enable_parallel_downloads:
# Fall back to sequential processing
return self._execute_latest_per_channel_sequential(channel_plans, cache_file)
# Create parallel downloader
parallel_downloader = create_parallel_downloader(
yt_dlp_path=str(self.yt_dlp_path),
config=self.config,
downloads_dir=self.downloads_dir,
max_workers=self.parallel_workers,
songlist_tracking=self.songlist_tracking,
tracker=self.tracker,
)
# Convert all videos to tasks
all_tasks = []
task_to_channel_mapping = {} # Map task identifiers to their channel index
for channel_idx, channel_plan in enumerate(channel_plans):
channel_name = channel_plan["channel_name"]
videos = channel_plan["videos"]
for v_idx, video in enumerate(videos):
title = video["title"]
video_id = video["id"]
# Sanitize filename
safe_title = title
invalid_chars = ["?", ":", "*", '"', "<", ">", "|", "/", "\\"]
for char in invalid_chars:
safe_title = safe_title.replace(char, "")
safe_title = (
safe_title.replace("...", "")
.replace("..", "")
.replace(".", "")
.strip()
)
# Extract artist and title for tracking
artist, title_clean = self.channel_parser.extract_artist_title(title, channel_name)
task = DownloadTask(
video_id=video_id,
artist=artist,
title=title_clean,
channel_name=channel_name,
video_title=title,
priority=len(channel_plans) - channel_idx, # Higher priority for earlier channels
)
all_tasks.append(task)
# Use a unique identifier for the mapping
task_id = f"{video_id}_{channel_name}_{title_clean}"
task_to_channel_mapping[task_id] = channel_idx
print(f"🚀 Starting parallel latest-per-channel downloads with {self.parallel_workers} workers...")
print(f"📋 Total tasks: {len(all_tasks)} across {len(channel_plans)} channels")
# Add all tasks to the parallel downloader
parallel_downloader.add_download_tasks(all_tasks)
# Execute parallel downloads
results = parallel_downloader.execute_downloads(show_progress=True)
# Process results and update channel plans
successful_downloads = 0
channel_completion_status = {i: False for i in range(len(channel_plans))}
for result in results:
if result.success:
successful_downloads += 1
# Create the same task identifier used in the mapping
task_id = f"{result.task.video_id}_{result.task.channel_name}_{result.task.title}"
channel_idx = task_to_channel_mapping[task_id]
# Mark this video as completed in the channel plan
channel_plan = channel_plans[channel_idx]
for v_idx, video in enumerate(channel_plan["videos"]):
if (video["id"] == result.task.video_id and
video["title"] == result.task.video_title):
channel_plan["videos"].pop(v_idx)
break
# Check if channel is complete
if len(channel_plan["videos"]) == 0 and not channel_completion_status[channel_idx]:
channel_completion_status[channel_idx] = True
print(f" 🗑️ Channel {result.task.channel_name} completed and removed from plan cache.")
# Update cache with remaining videos
with open(cache_file, "w", encoding="utf-8") as f:
json.dump(
{
"timestamp": datetime.now().isoformat(),
"channel_plans": channel_plans,
},
f,
indent=2,
ensure_ascii=False,
)
print(f"🎉 Parallel latest-per-channel downloads completed!")
print(f" ✅ Successful: {successful_downloads}")
print(f" ❌ Failed: {len(results) - successful_downloads}")
# Delete cache if all channels are complete
all_complete = all(channel_completion_status.values())
if all_complete:
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted latest-per-channel plan cache after completion: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete latest-per-channel plan cache: {e}")
return True
def _execute_latest_per_channel_sequential(
self,
channel_plans,
cache_file,
):
"""
Execute latest-per-channel downloads sequentially (fallback method).
"""
total_channels = len(channel_plans)
for idx, channel_plan in enumerate(channel_plans):
channel_name = channel_plan["channel_name"]
channel_url = channel_plan["channel_url"]
videos = channel_plan["videos"]
print(
f"\n⬇️ Downloading {len(videos)} videos from channel {idx+1} of {total_channels}: {channel_name}"
)
for v_idx, video in enumerate(videos):
title = video["title"]
video_id = video["id"]
# Sanitize filename
safe_title = title
invalid_chars = ["?", ":", "*", '"', "<", ">", "|", "/", "\\"]
for char in invalid_chars:
safe_title = safe_title.replace(char, "")
safe_title = (
safe_title.replace("...", "")
.replace("..", "")
.replace(".", "")
.strip()
)
filename = f"{channel_name} - {safe_title}.mp4"
# Extract artist and title for tracking
artist, title_clean = self.channel_parser.extract_artist_title(title, channel_name)
print(
f" ({v_idx+1}/{len(videos)}) Processing: {artist} - {title_clean}"
)
# Use the new pipeline for consistent processing
pipeline = DownloadPipeline(
yt_dlp_path=str(self.yt_dlp_path),
config=self.config,
downloads_dir=self.downloads_dir,
songlist_tracking=self.songlist_tracking,
tracker=self.tracker,
)
if pipeline.execute_pipeline(
video_id=video_id,
artist=artist,
title=title_clean,
channel_name=channel_name,
video_title=title,
):
print(f" ✅ Successfully processed: {artist} - {title_clean}")
else:
print(f" ❌ Failed to process: {artist} - {title_clean}")
# After channel is done, remove it from the plan and update cache
channel_plans[idx]["videos"] = []
with open(cache_file, "w", encoding="utf-8") as f:
json.dump(
{
"timestamp": datetime.now().isoformat(),
"channel_plans": channel_plans,
},
f,
indent=2,
ensure_ascii=False,
)
print(
f" 🗑️ Channel {channel_name} completed and removed from plan cache."
)
# After all channels are done, delete the cache
if cache_file.exists():
try:
cache_file.unlink()
print(
f"🗑️ Deleted latest-per-channel plan cache after completion: {cache_file.name}"
)
except Exception as e:
print(f"⚠️ Could not delete latest-per-channel plan cache: {e}")
print(f"🎉 All latest videos downloaded for all channels!")
return True
def download_latest_per_channel(
self,
channel_urls,
limit=5,
force_refresh_download_plan=False,
fuzzy_match=False,
fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD,
force_download=False,
):
"""
Download the latest N videos from each channel in channel_urls.
- Pre-scan all channels for their latest N videos.
- Check against local songs file to avoid duplicates.
- Build a per-channel download plan and cache it.
- Resume robustly if interrupted (removes each channel from the plan as it completes).
- Deletes the plan cache when all channels are done.
"""
# Set force download flag
self.force_download = force_download
print(f"\n🎬 Downloading latest {limit} videos from each channel...")
# Load server songs for availability checking
server_songs = load_server_songs()
server_duplicates_tracking = load_server_duplicates_tracking()
plan_mode = "latest_per_channel"
# Include all parameters that affect the plan generation
plan_kwargs = {
"limit": limit,
"channels": len(channel_urls),
"fuzzy": fuzzy_match,
"threshold": fuzzy_threshold,
}
# Add channel URLs hash to ensure same channels = same cache
channels_hash = hashlib.md5(
"|".join(sorted(channel_urls)).encode()
).hexdigest()[:8]
plan_kwargs["channels_hash"] = channels_hash
cache_file = get_download_plan_cache_file(plan_mode, **plan_kwargs)
use_cache = False
if not force_refresh_download_plan and cache_file.exists():
try:
with open(cache_file, "r", encoding="utf-8") as f:
plan_data = json.load(f)
cache_time = datetime.fromisoformat(plan_data.get("timestamp"))
if datetime.now() - cache_time < timedelta(
days=DEFAULT_CACHE_EXPIRATION_DAYS
):
print(
f"🗂️ Using cached latest-per-channel plan from {cache_time} ({cache_file.name})."
)
channel_plans = plan_data["channel_plans"]
use_cache = True
except Exception as e:
print(f"⚠️ Could not load latest-per-channel plan cache: {e}")
if not use_cache:
print("\n🔎 Pre-scanning all channels for latest videos...")
channel_plans = []
total_found = 0
total_filtered = 0
total_marked = 0
for channel_url in channel_urls:
channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel: {channel_name} ({channel_url})")
available_videos = self.tracker.get_channel_video_list(
channel_url, yt_dlp_path=str(self.yt_dlp_path), force_refresh=False
)
print(
f" → Found {len(available_videos)} total videos for this channel."
)
# Pre-filter: Create a set of known duplicate keys for O(1) lookup
known_duplicate_keys = set()
for song_key in server_duplicates_tracking.keys():
known_duplicate_keys.add(song_key)
# Pre-filter videos to exclude known duplicates before processing
pre_filtered_videos = []
for video in available_videos:
artist, title = self.channel_parser.extract_artist_title(video["title"], channel_name)
song_key = create_song_key(artist, title)
if song_key not in known_duplicate_keys:
pre_filtered_videos.append(video)
print(
f" → After pre-filtering: {len(pre_filtered_videos)} videos not previously marked as duplicates."
)
# Process videos until we reach the limit for this channel
filtered_videos = []
videos_checked = 0
for video in pre_filtered_videos:
if len(filtered_videos) >= limit:
break # We have enough videos for this channel
videos_checked += 1
artist, title = self.channel_parser.extract_artist_title(video["title"], channel_name)
# Check if should skip this song during planning phase
should_skip, reason, filtered_count = self._should_skip_song(
artist,
title,
channel_name,
video["id"],
video["title"],
server_songs,
server_duplicates_tracking,
)
if should_skip:
total_filtered += 1
if reason == "on server":
total_marked += filtered_count
continue
filtered_videos.append(video)
print(
f" → After processing: {len(filtered_videos)} videos to download (checked {videos_checked} videos, filtered out {videos_checked - len(filtered_videos)} already on server)."
)
total_found += len(filtered_videos)
channel_plans.append(
{
"channel_name": channel_name,
"channel_url": channel_url,
"videos": filtered_videos,
}
)
print(
f"\n📊 Summary: {total_found} videos to download across {len(channel_plans)} channels (filtered out {total_filtered} already on server, marked {total_marked} new duplicates for future skipping)."
)
plan_data = {
"timestamp": datetime.now().isoformat(),
"channel_plans": channel_plans,
}
with open(cache_file, "w", encoding="utf-8") as f:
json.dump(plan_data, f, indent=2, ensure_ascii=False)
print(f"🗂️ Saved new latest-per-channel plan cache: {cache_file.name}")
# --- Download phase ---
return self.execute_latest_per_channel_parallel(channel_plans, cache_file)
def _process_videos_for_download(self, available_videos, channel_name, force_refresh=False, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD):
"""Process videos for download (used for both manual and regular channels).""" """Process videos for download (used for both manual and regular channels)."""
songlist = load_songlist(self.songlist_file_path) songlist = load_songlist(self.songlist_file_path)
# For manual videos with force_download, bypass songlist filtering
if force_download and channel_name == "@ManualVideos":
print(f"💪 Force mode enabled - downloading all {len(available_videos)} manual videos")
# Download all videos directly without songlist filtering
success_count = 0
for i, video in enumerate(available_videos, 1):
title = video["title"]
video_id = video["id"]
print(f"\n⬇️ Downloading {i}/{len(available_videos)}: {title}")
print(f" 🎬 Video: {title} ({channel_name})")
# Create filename from title
artist, extracted_title = self.channel_parser.extract_artist_title(title, channel_name)
if not artist and not extracted_title:
# Fallback: use the full title as filename
filename = sanitize_filename("", title)
else:
filename = sanitize_filename(artist, extracted_title)
# Download the video
success = self._download_single_video(video, channel_name, filename, force_download=True)
if success:
success_count += 1
print(f"\n✅ Downloaded {success_count}/{len(available_videos)} manual videos")
return success_count > 0
# Regular songlist-based processing
if not songlist: if not songlist:
print("⚠️ No songlist loaded. Skipping.") print("⚠️ No songlist loaded. Skipping.")
return False return False
@ -1169,6 +816,44 @@ class KaraokeDownloader:
print(f"\n🎉 Download complete! {success_count}/{len(matches)} videos downloaded successfully") print(f"\n🎉 Download complete! {success_count}/{len(matches)} videos downloaded successfully")
return success_count > 0 return success_count > 0
def _download_single_video(self, video, channel_name, filename, force_download=False):
"""Download a single video using the download pipeline."""
video_id = video["id"]
video_title = video["title"]
# Extract artist and title for tracking
artist, extracted_title = self.channel_parser.extract_artist_title(video_title, channel_name)
if not artist and not extracted_title:
# Fallback: use the full title
artist = ""
extracted_title = video_title
output_path = self.downloads_dir / channel_name / filename
# Use the download pipeline
pipeline = DownloadPipeline(
yt_dlp_path=str(self.yt_dlp_path),
config=self.config,
downloads_dir=self.downloads_dir,
songlist_tracking=self.songlist_tracking,
tracker=self.tracker,
)
success = pipeline.execute_pipeline(
video_id=video_id,
artist=artist,
title=extracted_title,
channel_name=channel_name,
video_title=video_title
)
if success:
print(f"✅ Successfully downloaded: {video_title}")
else:
print(f"❌ Failed to download: {video_title}")
return success
def reset_songlist_all(): def reset_songlist_all():
"""Delete all files tracked in songlist_tracking.json, clear songlist_tracking.json, and remove songlist songs from karaoke_tracking.json.""" """Delete all files tracked in songlist_tracking.json, clear songlist_tracking.json, and remove songlist songs from karaoke_tracking.json."""