Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

This commit is contained in:
mbrucedogs 2025-07-24 19:44:08 -05:00
parent b473dbf329
commit 5580207c6c
4 changed files with 361 additions and 136 deletions

6
PRD.md
View File

@ -65,6 +65,8 @@ python download_karaoke.py --clear-cache SingKingKaraoke
- ✅ Automatic cleanup of extra yt-dlp files - ✅ Automatic cleanup of extra yt-dlp files
- ✅ **Reset/clear channel tracking and files via CLI** - ✅ **Reset/clear channel tracking and files via CLI**
- ✅ **Clear channel cache via CLI** - ✅ **Clear channel cache via CLI**
- ✅ **Download plan pre-scan and caching**: Before downloading, the tool pre-scans all channels for songlist matches, builds a download plan, and prints stats. The plan is cached for 1 day in data/download_plan_cache.json for fast resuming and reliability. Use --force-download-plan to force a refresh.
- ✅ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
--- ---
@ -114,6 +116,8 @@ KaroakeVideoDownloader/
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel** - `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel** - `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all** - `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--force-download-plan`: **Force refresh the download plan cache (re-scan all channels for matches)**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
--- ---
@ -124,6 +128,8 @@ KaroakeVideoDownloader/
- **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files. - **ID3 Tagging:** Artist/title extracted from video title and embedded in MP4 files.
- **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download. - **Cleanup:** Extra files from yt-dlp (e.g., `.info.json`) are automatically removed after download.
- **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels. - **Reset/Clear:** Use `--reset-channel` to reset all tracking and files for a channel (optionally including songlist songs with `--reset-songlist`). Use `--clear-cache` to clear cached video lists for a channel or all channels.
- **Download plan pre-scan:** Before downloading, the tool scans all channels for songlist matches, builds a download plan, and prints stats (matches, unmatched, per-channel breakdown). The plan is cached for 1 day and reused unless --force-download-plan is set.
- **Latest-per-channel plan:** Download the latest N videos from each channel, with a per-channel plan and robust resume. Each channel is removed from the plan as it completes. Plan cache is deleted when all channels are done.
--- ---

View File

@ -12,6 +12,7 @@ A Python-based Windows CLI tool to download karaoke videos from YouTube channels
- 🧹 **Automatic Cleanup**: Removes extra yt-dlp files - 🧹 **Automatic Cleanup**: Removes extra yt-dlp files
- 📈 **Real-Time Progress**: Detailed console and log output - 📈 **Real-Time Progress**: Detailed console and log output
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI - 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
## 📋 Requirements ## 📋 Requirements
- **Windows 10/11** - **Windows 10/11**
@ -56,6 +57,11 @@ python download_karaoke.py --limit 5
python download_karaoke.py --resolution 1080p python download_karaoke.py --resolution 1080p
``` ```
### Download Latest N Videos Per Channel
```bash
python download_karaoke.py --file data/channels.txt --latest-per-channel --limit 5
```
### **Reset/Start Over for a Channel** ### **Reset/Start Over for a Channel**
```bash ```bash
python download_karaoke.py --reset-channel SingKingKaraoke python download_karaoke.py --reset-channel SingKingKaraoke
@ -135,6 +141,7 @@ KaroakeVideoDownloader/
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel** - `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel** - `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all** - `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
## 📝 Example Usage ## 📝 Example Usage
```bash ```bash

View File

@ -35,6 +35,8 @@ Examples:
parser.add_argument('--reset-songlist', action='store_true', help='When used with --reset-channel, also reset songlist songs for this channel') parser.add_argument('--reset-songlist', action='store_true', help='When used with --reset-channel, also reset songlist songs for this channel')
parser.add_argument('--reset-songlist-all', action='store_true', help='Reset all songlist tracking and delete all songlist-downloaded files (global)') parser.add_argument('--reset-songlist-all', action='store_true', help='Reset all songlist tracking and delete all songlist-downloaded files (global)')
parser.add_argument('--version', '-v', action='version', version='Karaoke Playlist Downloader v1.0') parser.add_argument('--version', '-v', action='version', version='Karaoke Playlist Downloader v1.0')
parser.add_argument('--force-download-plan', action='store_true', help='Force refresh the download plan cache (re-scan all channels for matches)')
parser.add_argument('--latest-per-channel', action='store_true', help='Download the latest N videos from each channel (use with --limit)')
args = parser.parse_args() args = parser.parse_args()
yt_dlp_path = Path("downloader/yt-dlp.exe") yt_dlp_path = Path("downloader/yt-dlp.exe")
@ -158,9 +160,17 @@ Examples:
with open(args.file, "r", encoding="utf-8") as f: with open(args.file, "r", encoding="utf-8") as f:
channel_urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")] channel_urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")]
limit = args.limit if args.limit else None limit = args.limit if args.limit else None
success = downloader.download_songlist_across_channels(channel_urls, limit=limit) force_refresh_download_plan = args.force_download_plan if hasattr(args, 'force_download_plan') else False
success = downloader.download_songlist_across_channels(channel_urls, limit=limit, force_refresh_download_plan=force_refresh_download_plan)
elif args.url: elif args.url:
success = downloader.download_channel_videos(args.url, force_refresh=args.refresh) success = downloader.download_channel_videos(args.url, force_refresh=args.refresh)
elif args.latest_per_channel and args.file:
# Read all channel URLs from file
with open(args.file, "r", encoding="utf-8") as f:
channel_urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")]
limit = args.limit if args.limit else 5
force_refresh_download_plan = args.force_download_plan if hasattr(args, 'force_download_plan') else False
success = downloader.download_latest_per_channel(channel_urls, limit=limit, force_refresh_download_plan=force_refresh_download_plan)
else: else:
parser.print_help() parser.print_help()
sys.exit(1) sys.exit(1)

View File

@ -4,7 +4,7 @@ import subprocess
import json import json
import re import re
from pathlib import Path from pathlib import Path
from datetime import datetime from datetime import datetime, timedelta
from karaoke_downloader.tracking_manager import TrackingManager, SongStatus, FormatType from karaoke_downloader.tracking_manager import TrackingManager, SongStatus, FormatType
from karaoke_downloader.id3_utils import add_id3_tags, extract_artist_title from karaoke_downloader.id3_utils import add_id3_tags, extract_artist_title
from karaoke_downloader.songlist_manager import ( from karaoke_downloader.songlist_manager import (
@ -14,6 +14,7 @@ from karaoke_downloader.songlist_manager import (
) )
from karaoke_downloader.youtube_utils import get_channel_info, get_playlist_info from karaoke_downloader.youtube_utils import get_channel_info, get_playlist_info
import logging import logging
import hashlib
DATA_DIR = Path("data") DATA_DIR = Path("data")
@ -249,7 +250,71 @@ class KaraokeDownloader:
print(f"🎉 All post-processing complete for: {output_path}") print(f"🎉 All post-processing complete for: {output_path}")
return True return True
def download_songlist_across_channels(self, channel_urls, limit=None): def build_download_plan(self, channel_urls, undownloaded):
"""
For each song in undownloaded, scan all channels for a match.
Return (download_plan, unmatched_songs):
- download_plan: list of dicts {artist, title, channel_name, channel_url, video_id, video_title}
- unmatched_songs: list of songs not found in any channel
"""
plan = []
unmatched = []
channel_match_counts = {}
for channel_url in channel_urls:
channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel: {channel_name} ({channel_url})")
available_videos = self.tracker.get_channel_video_list(
channel_url,
yt_dlp_path=str(self.yt_dlp_path),
force_refresh=False
)
matches_this_channel = 0
for song in undownloaded:
artist, title = song['artist'], song['title']
found = False
for video in available_videos:
v_artist, v_title = extract_artist_title(video['title'])
if (normalize_title(v_artist) == normalize_title(artist) and normalize_title(v_title) == normalize_title(title)) or \
(normalize_title(video['title']) == normalize_title(f"{artist} - {title}")):
# Only add if not already in plan (first channel wins)
if not any(p['artist'] == artist and p['title'] == title for p in plan):
plan.append({
'artist': artist,
'title': title,
'channel_name': channel_name,
'channel_url': channel_url,
'video_id': video['id'],
'video_title': video['title']
})
matches_this_channel += 1
found = True
break
# Don't break here; keep looking for all matches in this channel
channel_match_counts[channel_name] = matches_this_channel
print(f" → Found {matches_this_channel} songlist matches in this channel.")
# Now find unmatched songs
for song in undownloaded:
if not any(p['artist'] == song['artist'] and p['title'] == song['title'] for p in plan):
unmatched.append(song)
# Print summary table
print("\n📊 Channel match summary:")
for channel, count in channel_match_counts.items():
print(f" {channel}: {count} matches")
print(f" TOTAL: {sum(channel_match_counts.values())} matches across {len(channel_match_counts)} channels.")
return plan, unmatched
def get_download_plan_cache_file(self, mode, **kwargs):
"""Generate a unique cache filename based on mode and key parameters."""
parts = [f"plan_{mode}"]
for k, v in sorted(kwargs.items()):
parts.append(f"{k}{v}")
base = "_".join(parts)
# Hash for safety if string is long
if len(base) > 60:
base = base[:40] + "_" + hashlib.md5(base.encode()).hexdigest()
return Path(f"data/{base}.json")
def download_songlist_across_channels(self, channel_urls, limit=None, force_refresh_download_plan=False):
""" """
For each song in the songlist, try each channel in order and download from the first channel where it is found. For each song in the songlist, try each channel in order and download from the first channel where it is found.
Download up to 'limit' songs, skipping any that cannot be found, until the limit is reached or all possible matches are exhausted. Download up to 'limit' songs, skipping any that cannot be found, until the limit is reached or all possible matches are exhausted.
@ -274,38 +339,66 @@ class KaraokeDownloader:
if not undownloaded: if not undownloaded:
print("🎵 All songlist songs already downloaded.") print("🎵 All songlist songs already downloaded.")
return True return True
print("🔍 Songs to search for:") # Removed per-song printout for cleaner output
for song in undownloaded: # print("🔍 Songs to search for:")
# for song in undownloaded:
# print(f" - {song['artist']} - {song['title']}")
# --- Download plan cache logic ---
plan_mode = "songlist"
plan_kwargs = {"limit": limit or "all", "channels": len(channel_urls)}
cache_file = self.get_download_plan_cache_file(plan_mode, **plan_kwargs)
use_cache = False
if not force_refresh_download_plan and cache_file.exists():
try:
with open(cache_file, 'r', encoding='utf-8') as f:
cache_data = json.load(f)
cache_time = datetime.fromisoformat(cache_data.get('timestamp'))
if datetime.now() - cache_time < timedelta(days=1):
print(f"🗂️ Using cached download plan from {cache_time} ({cache_file.name}).")
download_plan = cache_data['download_plan']
unmatched = cache_data['unmatched']
use_cache = True
except Exception as e:
print(f"⚠️ Could not load download plan cache: {e}")
if not use_cache:
print("\n🔎 Pre-scanning channels for matches...")
download_plan, unmatched = self.build_download_plan(channel_urls, undownloaded)
if download_plan:
cache_data = {
'timestamp': datetime.now().isoformat(),
'download_plan': download_plan,
'unmatched': unmatched
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(cache_data, f, indent=2, ensure_ascii=False)
print(f"🗂️ Saved new download plan cache: {cache_file.name}")
else:
if cache_file.exists():
cache_file.unlink()
print(f"🗂️ No matches found, not saving download plan cache.")
print(f"\n📊 Download plan ready: {len(download_plan)} songs will be downloaded.")
print(f"{len(unmatched)} songs could not be found in any channel.")
if unmatched:
print("Unmatched songs:")
for song in unmatched[:10]:
print(f" - {song['artist']} - {song['title']}") print(f" - {song['artist']} - {song['title']}")
if len(unmatched) > 10:
print(f" ...and {len(unmatched)-10} more.")
# --- Download phase ---
downloaded_count = 0 downloaded_count = 0
attempted = set() total_to_download = limit if limit is not None else len(download_plan)
total_to_download = limit if limit is not None else len(undownloaded) for idx, item in enumerate(download_plan):
print(f"\n🎬 Processing {len(channel_urls)} channels for song matches...")
# Keep looping until limit is reached or no more undownloaded songs
while undownloaded and (limit is None or downloaded_count < limit):
for song in list(undownloaded):
if limit is not None and downloaded_count >= limit: if limit is not None and downloaded_count >= limit:
break break
artist, title = song['artist'], song['title'] artist = item['artist']
if (artist, title) in attempted: title = item['title']
continue channel_name = item['channel_name']
found = False channel_url = item['channel_url']
print(f"\n🔍 Searching for: {artist} - {title}") video_id = item['video_id']
for channel_url in channel_urls: video_title = item['video_title']
channel_name, channel_id = get_channel_info(channel_url) print(f"\n⬇️ Downloading {idx+1} of {total_to_download}: {artist} - {title} (from {channel_name})")
available_videos = self.tracker.get_channel_video_list( # --- Existing download logic here, using channel_name, video_id, etc. ---
channel_url, # (Copy the download logic from the previous loop, using these variables)
yt_dlp_path=str(self.yt_dlp_path),
force_refresh=False
)
for video in available_videos:
v_artist, v_title = extract_artist_title(video['title'])
if (normalize_title(v_artist) == normalize_title(artist) and normalize_title(v_title) == normalize_title(title)) or \
(normalize_title(video['title']) == normalize_title(f"{artist} - {title}")):
# Progress print statement
print(f"📥 Downloading {downloaded_count + 1} of {total_to_download} songlist songs...")
print(f"🎯 Found on channel: {channel_name}")
# Download this song from this channel
# Create a shorter, safer filename - do this ONCE and use consistently # Create a shorter, safer filename - do this ONCE and use consistently
safe_title = title.replace("(From ", "").replace(")", "").replace(" - ", " ").replace(":", "").replace("'", "").replace('"', "") safe_title = title.replace("(From ", "").replace(")", "").replace(" - ", " ").replace(":", "").replace("'", "").replace('"', "")
safe_artist = artist.replace("'", "").replace('"', "") safe_artist = artist.replace("'", "").replace('"', "")
@ -326,7 +419,7 @@ class KaraokeDownloader:
output_path = self.downloads_dir / channel_name / filename output_path = self.downloads_dir / channel_name / filename
output_path.parent.mkdir(parents=True, exist_ok=True) output_path.parent.mkdir(parents=True, exist_ok=True)
print(f"⬇️ Downloading: {artist} - {title} -> {output_path}") print(f"⬇️ Downloading: {artist} - {title} -> {output_path}")
video_url = f"https://www.youtube.com/watch?v={video['id']}" video_url = f"https://www.youtube.com/watch?v={video_id}"
dlp_cmd = [ dlp_cmd = [
str(self.yt_dlp_path), str(self.yt_dlp_path),
"--no-check-certificates", "--no-check-certificates",
@ -393,22 +486,131 @@ class KaraokeDownloader:
print(f"✅ Downloaded and tracked: {artist} - {title}") print(f"✅ Downloaded and tracked: {artist} - {title}")
print(f"🎉 All post-processing complete for: {output_path}") print(f"🎉 All post-processing complete for: {output_path}")
downloaded_count += 1 downloaded_count += 1
found = True # After each download, if this was the last song, delete the cache
break # Only download from first channel where found if idx + 1 == total_to_download:
if found: if cache_file.exists():
break try:
attempted.add((artist, title)) cache_file.unlink()
if found: print(f"🗑️ Deleted download plan cache after last song downloaded: {cache_file.name}")
undownloaded.remove(song) except Exception as e:
# If no new downloads in this pass, break to avoid infinite loop print(f"⚠️ Could not delete download plan cache: {e}")
if downloaded_count == 0 or (limit is not None and downloaded_count >= limit):
break
if undownloaded:
print(f"⚠️ {len(undownloaded)} songlist songs could not be found in any channel:")
for song in undownloaded:
print(f" - {song['artist']} - {song['title']}")
print(f"🎉 Downloaded {downloaded_count} songlist songs.") print(f"🎉 Downloaded {downloaded_count} songlist songs.")
print(f"📊 Summary: Processed {len(channel_urls)} channels, found {downloaded_count} songs, {len(undownloaded)} songs not found.") print(f"📊 Summary: Processed {len(channel_urls)} channels, found {downloaded_count} songs, {len(unmatched)} songs not found.")
# Delete the download plan cache if all planned downloads are done
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted download plan cache after completion: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete download plan cache: {e}")
return True
def download_latest_per_channel(self, channel_urls, limit=5, force_refresh_download_plan=False):
"""
Download the latest N videos from each channel in channel_urls.
- Pre-scan all channels for their latest N videos.
- Build a per-channel download plan and cache it.
- Resume robustly if interrupted (removes each channel from the plan as it completes).
- Deletes the plan cache when all channels are done.
"""
plan_mode = "latest_per_channel"
plan_kwargs = {"limit": limit, "channels": len(channel_urls)}
cache_file = self.get_download_plan_cache_file(plan_mode, **plan_kwargs)
use_cache = False
if not force_refresh_download_plan and cache_file.exists():
try:
with open(cache_file, 'r', encoding='utf-8') as f:
plan_data = json.load(f)
cache_time = datetime.fromisoformat(plan_data.get('timestamp'))
if datetime.now() - cache_time < timedelta(days=1):
print(f"🗂️ Using cached latest-per-channel plan from {cache_time} ({cache_file.name}).")
channel_plans = plan_data['channel_plans']
use_cache = True
except Exception as e:
print(f"⚠️ Could not load latest-per-channel plan cache: {e}")
if not use_cache:
print("\n🔎 Pre-scanning all channels for latest videos...")
channel_plans = []
for channel_url in channel_urls:
channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel: {channel_name} ({channel_url})")
available_videos = self.tracker.get_channel_video_list(
channel_url,
yt_dlp_path=str(self.yt_dlp_path),
force_refresh=False
)
# Sort by upload order (assume yt-dlp returns in order, or sort by id if available)
latest_videos = available_videos[:limit]
print(f" → Found {len(latest_videos)} latest videos for this channel.")
channel_plans.append({
'channel_name': channel_name,
'channel_url': channel_url,
'videos': latest_videos
})
plan_data = {
'timestamp': datetime.now().isoformat(),
'channel_plans': channel_plans
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(plan_data, f, indent=2, ensure_ascii=False)
print(f"🗂️ Saved new latest-per-channel plan cache: {cache_file.name}")
# --- Download phase ---
total_channels = len(channel_plans)
for idx, channel_plan in enumerate(channel_plans):
channel_name = channel_plan['channel_name']
channel_url = channel_plan['channel_url']
videos = channel_plan['videos']
print(f"\n⬇️ Downloading {len(videos)} videos from channel {idx+1} of {total_channels}: {channel_name}")
for v_idx, video in enumerate(videos):
title = video['title']
video_id = video['id']
# Sanitize filename
safe_title = title
invalid_chars = ['?', ':', '*', '"', '<', '>', '|', '/', '\\']
for char in invalid_chars:
safe_title = safe_title.replace(char, "")
safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip()
filename = f"{channel_name} - {safe_title}.mp4"
if len(filename) > 100:
filename = f"{channel_name[:30]} - {safe_title[:60]}.mp4"
output_path = self.downloads_dir / channel_name / filename
output_path.parent.mkdir(parents=True, exist_ok=True)
print(f" ({v_idx+1}/{len(videos)}) Downloading: {title} -> {output_path}")
video_url = f"https://www.youtube.com/watch?v={video_id}"
dlp_cmd = [
str(self.yt_dlp_path),
"--no-check-certificates",
"--ignore-errors",
"--no-warnings",
"-o", str(output_path),
"-f", self.config["download_settings"]["format"],
video_url
]
try:
result = subprocess.run(dlp_cmd, capture_output=True, text=True, check=True)
print(f" ✅ yt-dlp completed successfully")
except subprocess.CalledProcessError as e:
print(f" ❌ yt-dlp failed with exit code {e.returncode}")
print(f" ❌ yt-dlp stderr: {e.stderr}")
continue
if not output_path.exists() or output_path.stat().st_size == 0:
print(f" ❌ Download failed or file is empty: {output_path}")
continue
add_id3_tags(output_path, title, channel_name)
print(f" ✅ Downloaded and tagged: {title}")
# After channel is done, remove it from the plan and update cache
channel_plans[idx]['videos'] = []
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump({'timestamp': datetime.now().isoformat(), 'channel_plans': channel_plans}, f, indent=2, ensure_ascii=False)
print(f" 🗑️ Channel {channel_name} completed and removed from plan cache.")
# After all channels are done, delete the cache
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted latest-per-channel plan cache after completion: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete latest-per-channel plan cache: {e}")
print(f"🎉 All latest videos downloaded for all channels!")
return True return True
def _is_valid_mp4(self, file_path): def _is_valid_mp4(self, file_path):