Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

This commit is contained in:
mbrucedogs 2025-07-25 08:35:12 -05:00
parent c7627b4073
commit 84088b4424
14 changed files with 1719 additions and 485 deletions

37
PRD.md
View File

@ -1,8 +1,27 @@
# 🎤 Karaoke Video Downloader PRD (v2.2) # 🎤 Karaoke Video Downloader PRD (v3.1)
## ✅ Overview ## ✅ Overview
A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration. A Python-based Windows CLI tool to download karaoke videos from YouTube channels/playlists using `yt-dlp.exe`, with advanced tracking, songlist prioritization, and flexible configuration. The codebase has been refactored into a modular architecture for improved maintainability and separation of concerns.
---
## 🏗️ Architecture
The codebase has been refactored into focused modules:
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
- **`download_planner.py`**: Download plan building and channel scanning (optimized)
- **`cache_manager.py`**: Cache operations and file I/O management
- **`video_downloader.py`**: Core video download execution and orchestration
- **`channel_manager.py`**: Channel and file management operations
- **`downloader.py`**: Main orchestrator and CLI interface
### Benefits of Modular Architecture:
- **Single Responsibility**: Each module has a focused purpose
- **Testability**: Individual components can be tested separately
- **Maintainability**: Easier to find and fix issues
- **Reusability**: Components can be used independently
- **Robustness**: Better error handling and interruption recovery
--- ---
@ -12,6 +31,7 @@ A Python-based Windows CLI tool to download karaoke videos from YouTube channels
- Avoid re-downloading the same videos (robust tracking). - Avoid re-downloading the same videos (robust tracking).
- Prioritize and track a custom songlist across channels. - Prioritize and track a custom songlist across channels.
- Allow flexible, user-friendly configuration. - Allow flexible, user-friendly configuration.
- Provide robust interruption handling and progress recovery.
--- ---
@ -71,6 +91,8 @@ python download_karaoke.py --clear-cache SingKingKaraoke
- ✅ **Deduplication across channels**: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates. - ✅ **Deduplication across channels**: Ensures the same song (by artist + normalized title) is not downloaded more than once, even if it appears in multiple channels. Tracks unique keys and skips duplicates.
- ✅ **Fuzzy matching**: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib. - ✅ **Fuzzy matching**: Optionally use fuzzy string matching for songlist-to-video matching with configurable threshold (0-100, default 85). Uses rapidfuzz if available, falls back to difflib.
- ✅ **Default channel file**: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list. - ✅ **Default channel file**: If no --file is specified for songlist-only or latest-per-channel modes, automatically uses data/channels.txt as the default channel list.
- ✅ **Robust interruption handling**: Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- ✅ **Optimized scanning performance**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching of large songlists and channels.
--- ---
@ -78,8 +100,13 @@ python download_karaoke.py --clear-cache SingKingKaraoke
``` ```
KaroakeVideoDownloader/ KaroakeVideoDownloader/
├── karaoke_downloader/ # All core Python code and utilities ├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main downloader class │ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point │ ├── cli.py # CLI entry point
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── download_planner.py # Download plan building and channel scanning (optimized)
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── channel_manager.py # Channel and file management operations
│ ├── id3_utils.py # ID3 tagging helpers │ ├── id3_utils.py # ID3 tagging helpers
│ ├── songlist_manager.py # Songlist logic │ ├── songlist_manager.py # Songlist logic
│ ├── youtube_utils.py # YouTube helpers │ ├── youtube_utils.py # YouTube helpers
@ -140,6 +167,8 @@ KaroakeVideoDownloader/
- **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list. - **Deduplication across channels:** Tracks unique song keys (artist + normalized title) to ensure the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list.
- **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video". - **Fuzzy matching:** Uses string similarity algorithms to find approximate matches between songlist entries and video titles, tolerating minor differences, typos, or extra words like "Karaoke" or "Official Video".
- **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly. - **Default channel file:** For songlist-only and latest-per-channel modes, if no --file is specified, automatically uses data/channels.txt as the default channel list, reducing the need to specify the file path repeatedly.
- **Robust interruption handling:** Progress is saved after each download, and files are checked for existence before downloading to prevent re-downloads if the process is interrupted.
- **Optimized scanning algorithm:** High-performance channel scanning with O(n×m) complexity, pre-processed song lookups using sets and dictionaries, and early termination for faster matching of large songlists and channels.
--- ---
@ -149,3 +178,5 @@ KaroakeVideoDownloader/
- [ ] Download scheduling and retry logic - [ ] Download scheduling and retry logic
- [ ] More granular status reporting - [ ] More granular status reporting
- [ ] Parallel downloads for improved speed - [ ] Parallel downloads for improved speed
- [ ] Unit tests for all modules
- [ ] Integration tests for end-to-end workflows

View File

@ -12,11 +12,25 @@ A Python-based Windows CLI tool to download karaoke videos from YouTube channels
- 🧹 **Automatic Cleanup**: Removes extra yt-dlp files - 🧹 **Automatic Cleanup**: Removes extra yt-dlp files
- 📈 **Real-Time Progress**: Detailed console and log output - 📈 **Real-Time Progress**: Detailed console and log output
- 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI - 🧹 **Reset/Clear Channel**: Reset all tracking and files for a channel, or clear channel cache via CLI
- 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with a per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N. - 🗂️ **Latest-per-channel download**: Download the latest N videos from each channel in a single batch, with server deduplication, fuzzy matching support, per-channel download plan, robust resume, and unique plan cache. Use --latest-per-channel and --limit N.
- 🧩 **Fuzzy Matching**: Optionally use fuzzy string matching for songlist-to-video matching (with --fuzzy-match, requires rapidfuzz for best results) - 🧩 **Fuzzy Matching**: Optionally use fuzzy string matching for songlist-to-video matching (with --fuzzy-match, requires rapidfuzz for best results)
- ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads - ⚡ **Fast Mode with Early Exit**: When a limit is set, scans channels and songs in order, downloads immediately when a match is found, and stops as soon as the limit is reached with successful downloads
- 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list - 🔄 **Deduplication Across Channels**: Ensures the same song is not downloaded from multiple channels, even if it appears in more than one channel's video list
- 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time) - 📋 **Default Channel File**: Automatically uses data/channels.txt as the default channel list for songlist modes (no need to specify --file every time)
- 🛡️ **Robust Interruption Handling**: Progress is saved after each download, preventing re-downloads if the process is interrupted
- ⚡ **Optimized Scanning**: High-performance channel scanning with O(n×m) complexity, pre-processed lookups, and early termination for faster matching
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
## 🏗️ Architecture
The codebase has been refactored into a modular architecture for better maintainability and separation of concerns:
- **`fuzzy_matcher.py`**: Fuzzy matching logic and similarity functions
- **`download_planner.py`**: Download plan building and channel scanning (optimized)
- **`cache_manager.py`**: Cache operations and file I/O management
- **`server_manager.py`**: Server songs loading and server duplicates tracking
- **`video_downloader.py`**: Core video download execution and orchestration
- **`channel_manager.py`**: Channel and file management operations
- **`downloader.py`**: Main orchestrator and CLI interface
## 📋 Requirements ## 📋 Requirements
- **Windows 10/11** - **Windows 10/11**
@ -48,6 +62,11 @@ python download_karaoke.py --songlist-only --limit 10 --fuzzy-match --fuzzy-thre
python download_karaoke.py --latest-per-channel --limit 5 python download_karaoke.py --latest-per-channel --limit 5
``` ```
### Download Latest N Videos Per Channel (with fuzzy matching)
```bash
python download_karaoke.py --latest-per-channel --limit 5 --fuzzy-match --fuzzy-threshold 85
```
### Prioritize Songlist in Download Queue ### Prioritize Songlist in Download Queue
```bash ```bash
python download_karaoke.py --songlist-priority python download_karaoke.py --songlist-priority
@ -101,14 +120,21 @@ python download_karaoke.py --clear-cache all
## 🛠️ Tracking & Caching ## 🛠️ Tracking & Caching
- **data/karaoke_tracking.json**: Tracks all downloads, statuses, and formats - **data/karaoke_tracking.json**: Tracks all downloads, statuses, and formats
- **data/songlist_tracking.json**: Tracks global songlist download progress - **data/songlist_tracking.json**: Tracks global songlist download progress
- **data/server_duplicates_tracking.json**: Tracks songs found to be duplicates on the server for future skipping
- **data/channel_cache.json**: Caches channel video lists for performance - **data/channel_cache.json**: Caches channel video lists for performance
## 📂 Folder Structure ## 📂 Folder Structure
``` ```
KaroakeVideoDownloader/ KaroakeVideoDownloader/
├── karaoke_downloader/ # All core Python code and utilities ├── karaoke_downloader/ # All core Python code and utilities
│ ├── downloader.py # Main downloader class │ ├── downloader.py # Main orchestrator and CLI interface
│ ├── cli.py # CLI entry point │ ├── cli.py # CLI entry point
│ ├── fuzzy_matcher.py # Fuzzy matching logic and similarity functions
│ ├── download_planner.py # Download plan building and channel scanning (optimized)
│ ├── cache_manager.py # Cache operations and file I/O management
│ ├── server_manager.py # Server songs loading and server duplicates tracking
│ ├── video_downloader.py # Core video download execution and orchestration
│ ├── channel_manager.py # Channel and file management operations
│ ├── id3_utils.py # ID3 tagging helpers │ ├── id3_utils.py # ID3 tagging helpers
│ ├── songlist_manager.py # Songlist logic │ ├── songlist_manager.py # Songlist logic
│ ├── youtube_utils.py # YouTube helpers │ ├── youtube_utils.py # YouTube helpers
@ -147,6 +173,7 @@ KaroakeVideoDownloader/
- `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel** - `--reset-channel <CHANNEL_NAME>`: **Reset all tracking and files for a channel**
- `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel** - `--reset-songlist`: **When used with --reset-channel, also reset songlist songs for this channel**
- `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all** - `--clear-cache <CHANNEL_ID|all>`: **Clear channel video cache for a specific channel or all**
- `--clear-server-duplicates`: **Clear server duplicates tracking (allows re-checking songs against server)**
- `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)** - `--latest-per-channel`: **Download the latest N videos from each channel (use with --limit)**
- `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available) - `--fuzzy-match`: Enable fuzzy matching for songlist-to-video matching (uses rapidfuzz if available)
- `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85) - `--fuzzy-threshold <N>`: Fuzzy match threshold (0-100, default 85)
@ -166,6 +193,7 @@ python download_karaoke.py --songlist-only
python download_karaoke.py --reset-channel SingKingKaraoke python download_karaoke.py --reset-channel SingKingKaraoke
python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist python download_karaoke.py --reset-channel SingKingKaraoke --reset-songlist
python download_karaoke.py --clear-cache all python download_karaoke.py --clear-cache all
python download_karaoke.py --clear-server-duplicates
``` ```
## 🏷️ ID3 Tagging ## 🏷️ ID3 Tagging

View File

@ -0,0 +1,338 @@
{
"little richard_long tall sally": {
"artist": "Little Richard",
"title": "Long Tall Sally",
"video_title": "Little Richard - Long Tall Sally (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-24T22:11:47.738475",
"reason": "already_on_server"
},
"lobo_me and you and a dog named boo": {
"artist": "Lobo",
"title": "Me And You And A Dog Named Boo",
"video_title": "Lobo - Me And You And A Dog Named Boo (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T07:26:38.285721",
"reason": "already_on_server"
},
"royal teens_short shorts": {
"artist": "Royal Teens",
"title": "Short Shorts",
"video_title": "Royal Teens - Short Shorts (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T07:26:38.286537",
"reason": "already_on_server"
},
"traveling wilburys_end of the line": {
"artist": "Traveling Wilburys",
"title": "End Of The Line",
"video_title": "Traveling Wilburys - End Of The Line (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T07:52:51.215910",
"reason": "already_on_server"
},
"george jones_a picture of me (without you)": {
"artist": "George Jones",
"title": "A Picture Of Me (Without You)",
"video_title": "George Jones - A Picture Of Me (Without You) (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:02:17.408739",
"reason": "already_on_server"
},
"lola young_messy": {
"artist": "Lola Young",
"title": "Messy",
"video_title": "Lola Young - Messy (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:02:17.429626",
"reason": "already_on_server"
},
"gigi perez_sailor song": {
"artist": "Gigi Perez",
"title": "Sailor Song",
"video_title": "Gigi Perez - Sailor Song (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:02:17.431932",
"reason": "already_on_server"
},
"sum 41_fat lip": {
"artist": "Sum 41",
"title": "Fat Lip",
"video_title": "Sum 41 - Fat Lip (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:02:17.434162",
"reason": "already_on_server"
},
"the verve_bitter sweet symphony": {
"artist": "The Verve",
"title": "Bitter Sweet Symphony",
"video_title": "The Verve - Bitter Sweet Symphony (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:02:17.436617",
"reason": "already_on_server"
},
"lionel richie_all night long": {
"artist": "Lionel Richie",
"title": "All Night Long",
"video_title": "Lionel Richie - All Night Long (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:02:17.440237",
"reason": "already_on_server"
},
"kenny rogers_the gambler": {
"artist": "Kenny Rogers",
"title": "The Gambler",
"video_title": "Kenny Rogers - The Gambler (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:06:39.419631",
"reason": "already_on_server"
},
"rod stewart_maggie may": {
"artist": "Rod Stewart",
"title": "Maggie May",
"video_title": "Rod Stewart - Maggie May (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:06:39.422101",
"reason": "already_on_server"
},
"tom jones_it's not unusual": {
"artist": "Tom Jones",
"title": "It's Not Unusual",
"video_title": "Tom Jones - It's Not Unusual (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:06:39.424447",
"reason": "already_on_server"
},
"morgan wallen_i got better": {
"artist": "Morgan Wallen",
"title": "I Got Better",
"video_title": "Morgan Wallen - I Got Better (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T08:33:25.590485",
"reason": "already_on_server"
},
"ella langley_weren't for the wind": {
"artist": "Ella Langley",
"title": "weren't for the wind",
"video_title": "Ella Langley - weren't for the wind (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T08:33:25.593194",
"reason": "already_on_server"
},
"bell biv devoe_poison": {
"artist": "Bell Biv Devoe",
"title": "Poison",
"video_title": "Bell Biv Devoe - Poison (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T08:33:25.607400",
"reason": "already_on_server"
},
"morgan wallen_superman": {
"artist": "Morgan Wallen",
"title": "Superman",
"video_title": "Morgan Wallen - Superman (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T08:33:25.620085",
"reason": "already_on_server"
},
"the fray_look after you": {
"artist": "The Fray",
"title": "Look After You",
"video_title": "The Fray - Look After You (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T08:33:25.634792",
"reason": "already_on_server"
},
"justin bieber_one less lonely girl": {
"artist": "Justin Bieber",
"title": "One Less Lonely Girl",
"video_title": "Justin Bieber - One Less Lonely Girl (Karaoke Version)",
"channel": "SingKingKaraoke",
"marked_at": "2025-07-25T08:33:25.639304",
"reason": "already_on_server"
},
"the beatles_all my loving": {
"artist": "The Beatles",
"title": "All My Loving",
"video_title": "The Beatles - All My Loving (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.743418",
"reason": "already_on_server"
},
"james taylor_sweet baby james": {
"artist": "James Taylor",
"title": "Sweet Baby James",
"video_title": "James Taylor - Sweet Baby James (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.746800",
"reason": "already_on_server"
},
"phil collins_sussudio": {
"artist": "Phil Collins",
"title": "Sussudio",
"video_title": "Phil Collins - Sussudio (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.749990",
"reason": "already_on_server"
},
"avril lavigne_things i'll never say": {
"artist": "Avril Lavigne",
"title": "Things I'll Never Say",
"video_title": "Avril Lavigne - Things I'll Never Say (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.766538",
"reason": "already_on_server"
},
"def leppard_bringin' on the heartbreak": {
"artist": "Def Leppard",
"title": "Bringin' On The Heartbreak",
"video_title": "Def Leppard - Bringin' On The Heartbreak (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.793929",
"reason": "already_on_server"
},
"no doubt_rock steady": {
"artist": "No Doubt",
"title": "Rock Steady",
"video_title": "No Doubt - Rock Steady (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.797153",
"reason": "already_on_server"
},
"ultravox_vienna": {
"artist": "Ultravox",
"title": "Vienna",
"video_title": "Ultravox - Vienna (Karaoke)",
"channel": "KaraokeOnVEVO",
"marked_at": "2025-07-25T08:33:25.798966",
"reason": "already_on_server"
},
"nickelback_far away": {
"artist": "Nickelback",
"title": "Far Away",
"video_title": "Nickelback - Far Away (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.835135",
"reason": "already_on_server"
},
"lana del rey_diet mountain dew": {
"artist": "Lana Del Rey",
"title": "Diet Mountain Dew",
"video_title": "Lana Del Rey - Diet Mountain Dew (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.837998",
"reason": "already_on_server"
},
"poison_every rose has its thorn": {
"artist": "Poison",
"title": "Every Rose Has Its Thorn",
"video_title": "Poison - Every Rose Has Its Thorn (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.841689",
"reason": "already_on_server"
},
"adele_hometown glory": {
"artist": "Adele",
"title": "Hometown Glory",
"video_title": "Adele - Hometown Glory (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.850667",
"reason": "already_on_server"
},
"lorde_green light": {
"artist": "Lorde",
"title": "Green Light",
"video_title": "Lorde - Green Light (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.856011",
"reason": "already_on_server"
},
"the isley brothers_shout": {
"artist": "The Isley Brothers",
"title": "Shout",
"video_title": "The Isley Brothers - Shout (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.861753",
"reason": "already_on_server"
},
"tate mcrae_sports car": {
"artist": "Tate McRae",
"title": "Sports Car",
"video_title": "Tate McRae - Sports Car (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.864819",
"reason": "already_on_server"
},
"myles smith_stargazing": {
"artist": "Myles Smith",
"title": "Stargazing",
"video_title": "Myles Smith - Stargazing (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.876345",
"reason": "already_on_server"
},
"belinda carlisle_heaven is a place on earth": {
"artist": "Belinda Carlisle",
"title": "Heaven Is A Place On Earth",
"video_title": "Belinda Carlisle - Heaven Is A Place On Earth (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.883470",
"reason": "already_on_server"
},
"r.e.m._losing my religion": {
"artist": "R.E.M.",
"title": "Losing My Religion",
"video_title": "R.E.M. - Losing My Religion (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.888733",
"reason": "already_on_server"
},
"bad bunny_dtmf": {
"artist": "Bad Bunny",
"title": "DtMF",
"video_title": "Bad Bunny - DtMF (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.898975",
"reason": "already_on_server"
},
"lady gaga_judas": {
"artist": "Lady Gaga",
"title": "Judas",
"video_title": "Lady Gaga - Judas (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.900680",
"reason": "already_on_server"
},
"lisa_money": {
"artist": "Lisa",
"title": "Money",
"video_title": "Lisa - Money (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.902196",
"reason": "already_on_server"
},
"alex warren_ordinary": {
"artist": "Alex Warren",
"title": "Ordinary",
"video_title": "Alex Warren - Ordinary (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.906505",
"reason": "already_on_server"
},
"nickelback_how you remind me": {
"artist": "Nickelback",
"title": "How You Remind Me",
"video_title": "Nickelback - How You Remind Me (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.908105",
"reason": "already_on_server"
},
"green day_american idiot": {
"artist": "Green Day",
"title": "American Idiot",
"video_title": "Green Day - American Idiot (Karaoke Version)",
"channel": "StingrayKaraoke",
"marked_at": "2025-07-25T08:33:25.909641",
"reason": "already_on_server"
}
}

View File

@ -0,0 +1,76 @@
"""
Cache management utilities for download plans.
Handles caching, loading, and cleanup of download plan data.
"""
import json
import hashlib
from pathlib import Path
from datetime import datetime, timedelta
# Constants
DEFAULT_CACHE_EXPIRATION_DAYS = 1
DEFAULT_CACHE_FILENAME_LENGTH_LIMIT = 200 # Increased from 60
DEFAULT_CACHE_FILENAME_PREFIX_LENGTH = 100 # Increased from 40
def get_download_plan_cache_file(mode, **kwargs):
"""Generate a unique cache filename based on mode and key parameters."""
parts = [f"plan_{mode}"]
# Handle parameters in a more readable way
for k, v in sorted(kwargs.items()):
if k == "channels_hash":
# Use a shorter version of the hash for readability
parts.append(f"hash{v[:8]}")
else:
parts.append(f"{k}{v}")
base = "_".join(parts)
# Hash for safety if string is still too long
if len(base) > DEFAULT_CACHE_FILENAME_LENGTH_LIMIT:
base = base[:DEFAULT_CACHE_FILENAME_PREFIX_LENGTH] + "_" + hashlib.md5(base.encode()).hexdigest()[:8]
return Path(f"data/{base}.json")
def load_cached_plan(cache_file, max_age_days=DEFAULT_CACHE_EXPIRATION_DAYS):
"""Load a cached download plan if it exists and is not expired."""
if not cache_file.exists():
return None, None
try:
with open(cache_file, 'r', encoding='utf-8') as f:
cache_data = json.load(f)
cache_time = datetime.fromisoformat(cache_data.get('timestamp'))
if datetime.now() - cache_time < timedelta(days=max_age_days):
print(f"🗂️ Using cached download plan from {cache_time} ({cache_file.name}).")
return cache_data['download_plan'], cache_data['unmatched']
except Exception as e:
print(f"⚠️ Could not load download plan cache: {e}")
return None, None
def save_plan_cache(cache_file, download_plan, unmatched):
"""Save a download plan to cache."""
if download_plan:
cache_data = {
'timestamp': datetime.now().isoformat(),
'download_plan': download_plan,
'unmatched': unmatched
}
with open(cache_file, 'w', encoding='utf-8') as f:
json.dump(cache_data, f, indent=2, ensure_ascii=False)
print(f"🗂️ Saved new download plan cache: {cache_file.name}")
else:
if cache_file.exists():
cache_file.unlink()
print(f"🗂️ No matches found, not saving download plan cache.")
def delete_plan_cache(cache_file):
"""Delete a download plan cache file."""
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted download plan cache: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete download plan cache: {e}")

View File

@ -0,0 +1,93 @@
import os
from pathlib import Path
from karaoke_downloader.songlist_manager import (
save_songlist_tracking, is_songlist_song_downloaded, normalize_title
)
def reset_channel_downloads(tracker, songlist_tracking, songlist_tracking_file, channel_name, reset_songlist=False, delete_files=False):
"""
Reset all tracking and optionally files for a channel.
If reset_songlist is False, songlist songs are preserved (tracking and files).
If reset_songlist is True, songlist songs for this channel are also reset/deleted.
"""
print(f"\n🔄 Resetting channel: {channel_name} (reset_songlist={reset_songlist}, delete_files={delete_files})")
# Find channel_id from channel_name
channel_id = None
for pid, playlist in tracker.data.get('playlists', {}).items():
if playlist['name'] == channel_name or pid == channel_name:
channel_id = pid
break
if not channel_id:
print(f"❌ Channel '{channel_name}' not found in tracking.")
return
# Get all songs for this channel
songs_to_reset = []
for song_id, song in tracker.data.get('songs', {}).items():
if song['playlist_id'] == channel_id:
# Check if this is a songlist song
artist, title = song.get('artist', ''), song.get('title', song.get('name', ''))
key = f"{artist.lower()}_{normalize_title(title)}"
is_songlist = key in songlist_tracking
if is_songlist and not reset_songlist:
continue # skip songlist songs if not resetting them
songs_to_reset.append((song_id, song, is_songlist))
# Reset tracking and optionally delete files
files_preserved = 0
files_deleted = 0
for song_id, song, is_songlist in songs_to_reset:
# Remove from main tracking
tracker.data['songs'][song_id]['status'] = 'NOT_DOWNLOADED'
tracker.data['songs'][song_id]['formats'] = {}
tracker.data['songs'][song_id]['last_error'] = ''
tracker.data['songs'][song_id]['download_attempts'] = 0
tracker.data['songs'][song_id]['last_updated'] = None
# Remove from songlist tracking if needed
if is_songlist and reset_songlist:
artist, title = song.get('artist', ''), song.get('title', song.get('name', ''))
key = f"{artist.lower()}_{normalize_title(title)}"
if key in songlist_tracking:
del songlist_tracking[key]
# Delete file if requested
if delete_files:
file_path = song.get('file_path')
if file_path:
try:
p = Path(file_path)
if p.exists():
p.unlink()
files_deleted += 1
else:
files_preserved += 1
except Exception as e:
print(f"⚠️ Could not delete file {file_path}: {e}")
# Remove all songlist_tracking entries for this channel if reset_songlist is True
if reset_songlist:
keys_to_remove = [k for k, v in songlist_tracking.items() if v.get('channel') == channel_name]
for k in keys_to_remove:
del songlist_tracking[k]
# Save changes
tracker.force_save()
save_songlist_tracking(songlist_tracking, str(songlist_tracking_file))
print(f"✅ Reset {len(songs_to_reset)} songs for channel '{channel_name}'.")
if delete_files:
print(f" Files deleted: {files_deleted}, files preserved: {files_preserved}")
if not reset_songlist:
print(f" Songlist songs were preserved.")
def download_from_file(self, file_path, force_refresh=False):
file = Path(file_path)
if not file.exists():
print(f"❌ File not found: {file_path}")
return False
with open(file, "r", encoding="utf-8") as f:
urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")]
if not urls:
print(f"❌ No URLs found in {file_path}")
return False
all_success = True
for url in urls:
print(f"\n➡️ Processing: {url}")
success = self.download_channel_videos(url, force_refresh=force_refresh)
if not success:
all_success = False
return all_success

View File

@ -4,6 +4,12 @@ from pathlib import Path
from karaoke_downloader.downloader import KaraokeDownloader from karaoke_downloader.downloader import KaraokeDownloader
import os import os
# Constants
DEFAULT_FUZZY_THRESHOLD = 85
DEFAULT_LATEST_PER_CHANNEL_LIMIT = 5
DEFAULT_DISPLAY_LIMIT = 10
DEFAULT_CACHE_DURATION_HOURS = 24
def main(): def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Karaoke Video Downloader - Download YouTube playlists and channel videos for karaoke", description="Karaoke Video Downloader - Download YouTube playlists and channel videos for karaoke",
@ -35,6 +41,7 @@ Examples:
parser.add_argument('--reset-channel', metavar='CHANNEL_NAME', help='Reset all tracking and files for a channel') parser.add_argument('--reset-channel', metavar='CHANNEL_NAME', help='Reset all tracking and files for a channel')
parser.add_argument('--reset-songlist', action='store_true', help='When used with --reset-channel, also reset songlist songs for this channel') parser.add_argument('--reset-songlist', action='store_true', help='When used with --reset-channel, also reset songlist songs for this channel')
parser.add_argument('--reset-songlist-all', action='store_true', help='Reset all songlist tracking and delete all songlist-downloaded files (global)') parser.add_argument('--reset-songlist-all', action='store_true', help='Reset all songlist tracking and delete all songlist-downloaded files (global)')
parser.add_argument('--clear-server-duplicates', action='store_true', help='Clear server duplicates tracking (allows re-checking songs against server)')
parser.add_argument('--version', '-v', action='version', version='Karaoke Playlist Downloader v1.0') parser.add_argument('--version', '-v', action='version', version='Karaoke Playlist Downloader v1.0')
parser.add_argument('--force-download-plan', action='store_true', help='Force refresh the download plan cache (re-scan all channels for matches)') parser.add_argument('--force-download-plan', action='store_true', help='Force refresh the download plan cache (re-scan all channels for matches)')
parser.add_argument('--latest-per-channel', action='store_true', help='Download the latest N videos from each channel (use with --limit)') parser.add_argument('--latest-per-channel', action='store_true', help='Download the latest N videos from each channel (use with --limit)')
@ -101,6 +108,13 @@ Examples:
print('✅ All songlist tracking and files have been reset.') print('✅ All songlist tracking and files have been reset.')
sys.exit(0) sys.exit(0)
if args.clear_server_duplicates:
from karaoke_downloader.server_manager import save_server_duplicates_tracking
save_server_duplicates_tracking({})
print('✅ Server duplicates tracking has been cleared.')
print(' Songs will be re-checked against the server on next run.')
sys.exit(0)
if args.status: if args.status:
stats = downloader.tracker.get_statistics() stats = downloader.tracker.get_statistics()
print("🎤 Karaoke Downloader Status") print("🎤 Karaoke Downloader Status")
@ -169,7 +183,7 @@ Examples:
limit = args.limit if args.limit else None limit = args.limit if args.limit else None
force_refresh_download_plan = args.force_download_plan if hasattr(args, 'force_download_plan') else False force_refresh_download_plan = args.force_download_plan if hasattr(args, 'force_download_plan') else False
fuzzy_match = args.fuzzy_match if hasattr(args, 'fuzzy_match') else False fuzzy_match = args.fuzzy_match if hasattr(args, 'fuzzy_match') else False
fuzzy_threshold = args.fuzzy_threshold if hasattr(args, 'fuzzy_threshold') else 90 fuzzy_threshold = args.fuzzy_threshold if hasattr(args, 'fuzzy_threshold') else DEFAULT_FUZZY_THRESHOLD
success = downloader.download_songlist_across_channels(channel_urls, limit=limit, force_refresh_download_plan=force_refresh_download_plan, fuzzy_match=fuzzy_match, fuzzy_threshold=fuzzy_threshold) success = downloader.download_songlist_across_channels(channel_urls, limit=limit, force_refresh_download_plan=force_refresh_download_plan, fuzzy_match=fuzzy_match, fuzzy_threshold=fuzzy_threshold)
elif args.latest_per_channel: elif args.latest_per_channel:
# Use provided file or default to data/channels.txt # Use provided file or default to data/channels.txt
@ -179,9 +193,11 @@ Examples:
sys.exit(1) sys.exit(1)
with open(channel_file, "r", encoding="utf-8") as f: with open(channel_file, "r", encoding="utf-8") as f:
channel_urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")] channel_urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")]
limit = args.limit if args.limit else 5 limit = args.limit if args.limit else DEFAULT_LATEST_PER_CHANNEL_LIMIT
force_refresh_download_plan = args.force_download_plan if hasattr(args, 'force_download_plan') else False force_refresh_download_plan = args.force_download_plan if hasattr(args, 'force_download_plan') else False
success = downloader.download_latest_per_channel(channel_urls, limit=limit, force_refresh_download_plan=force_refresh_download_plan) fuzzy_match = args.fuzzy_match if hasattr(args, 'fuzzy_match') else False
fuzzy_threshold = args.fuzzy_threshold if hasattr(args, 'fuzzy_threshold') else DEFAULT_FUZZY_THRESHOLD
success = downloader.download_latest_per_channel(channel_urls, limit=limit, force_refresh_download_plan=force_refresh_download_plan, fuzzy_match=fuzzy_match, fuzzy_threshold=fuzzy_threshold)
elif args.url: elif args.url:
success = downloader.download_channel_videos(args.url, force_refresh=args.refresh) success = downloader.download_channel_videos(args.url, force_refresh=args.refresh)
else: else:

View File

@ -0,0 +1,77 @@
"""
Configuration management utilities.
Handles loading and managing application configuration.
"""
import json
from pathlib import Path
DATA_DIR = Path("data")
def load_config():
"""Load configuration from data/config.json or return defaults."""
config_file = DATA_DIR / "config.json"
if config_file.exists():
try:
with open(config_file, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"Warning: Could not load config.json: {e}")
return get_default_config()
def get_default_config():
"""Get the default configuration."""
return {
"download_settings": {
"format": "best[height<=720][ext=mp4]/best[height<=720]/best[ext=mp4]/best",
"preferred_resolution": "720p",
"audio_format": "mp3",
"audio_quality": "0",
"subtitle_language": "en",
"subtitle_format": "srt",
"write_metadata": False,
"write_thumbnail": False,
"write_description": False,
"write_annotations": False,
"write_comments": False,
"write_subtitles": False,
"embed_metadata": False,
"add_metadata": False,
"continue_downloads": True,
"no_overwrites": True,
"ignore_errors": True,
"no_warnings": False
},
"folder_structure": {
"downloads_dir": "downloads",
"logs_dir": "logs",
"tracking_file": str(DATA_DIR / "karaoke_tracking.json")
},
"logging": {
"level": "INFO",
"format": "%(asctime)s - %(levelname)s - %(message)s",
"include_console": True,
"include_file": True
},
"yt_dlp_path": "downloader/yt-dlp.exe"
}
def save_config(config):
"""Save configuration to data/config.json."""
config_file = DATA_DIR / "config.json"
config_file.parent.mkdir(exist_ok=True)
try:
with open(config_file, 'w', encoding='utf-8') as f:
json.dump(config, f, indent=2, ensure_ascii=False)
return True
except Exception as e:
print(f"Error saving config: {e}")
return False
def update_config(updates):
"""Update configuration with new values."""
config = load_config()
config.update(updates)
return save_config(config)

View File

@ -0,0 +1,129 @@
"""
Download plan building utilities.
Handles pre-scanning channels and building download plans.
"""
from karaoke_downloader.youtube_utils import get_channel_info
from karaoke_downloader.fuzzy_matcher import (
is_fuzzy_match,
is_exact_match,
create_song_key,
extract_artist_title,
get_similarity_function
)
from karaoke_downloader.cache_manager import (
get_download_plan_cache_file,
load_cached_plan,
save_plan_cache,
delete_plan_cache
)
# Constants
DEFAULT_FILENAME_LENGTH_LIMIT = 100
DEFAULT_ARTIST_LENGTH_LIMIT = 30
DEFAULT_TITLE_LENGTH_LIMIT = 60
DEFAULT_FUZZY_THRESHOLD = 85
def build_download_plan(channel_urls, undownloaded, tracker, yt_dlp_path, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD):
"""
For each song in undownloaded, scan all channels for a match.
Use fuzzy matching if enabled.
Return (download_plan, unmatched_songs):
- download_plan: list of dicts {artist, title, channel_name, channel_url, video_id, video_title, match_score}
- unmatched_songs: list of songs not found in any channel
"""
plan = []
unmatched = []
channel_match_counts = {}
# Pre-process songlist for O(1) lookups
song_keys = set()
song_lookup = {}
for song in undownloaded:
key = create_song_key(song['artist'], song['title'])
song_keys.add(key)
song_lookup[key] = song
for i, channel_url in enumerate(channel_urls, 1):
channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel {i}/{len(channel_urls)}: {channel_name} ({channel_url})")
available_videos = tracker.get_channel_video_list(
channel_url,
yt_dlp_path=str(yt_dlp_path),
force_refresh=False
)
print(f" 📊 Channel has {len(available_videos)} videos to scan against {len(undownloaded)} songlist songs")
matches_this_channel = 0
# Pre-process video titles for efficient matching
if fuzzy_match:
# For fuzzy matching, create normalized video keys
video_matches = []
for video in available_videos:
v_artist, v_title = extract_artist_title(video['title'])
video_key = create_song_key(v_artist, v_title)
# Find best match among remaining songs
best_match = None
best_score = 0
for song_key in song_keys:
if song_key in song_lookup: # Only check unmatched songs
score = get_similarity_function()(song_key, video_key)
if score >= fuzzy_threshold and score > best_score:
best_score = score
best_match = song_key
if best_match:
song = song_lookup[best_match]
video_matches.append({
'artist': song['artist'],
'title': song['title'],
'channel_name': channel_name,
'channel_url': channel_url,
'video_id': video['id'],
'video_title': video['title'],
'match_score': best_score
})
# Remove matched song from future consideration
del song_lookup[best_match]
song_keys.remove(best_match)
matches_this_channel += 1
else:
# For exact matching, use direct key comparison
for video in available_videos:
v_artist, v_title = extract_artist_title(video['title'])
video_key = create_song_key(v_artist, v_title)
if video_key in song_keys:
song = song_lookup[video_key]
video_matches.append({
'artist': song['artist'],
'title': song['title'],
'channel_name': channel_name,
'channel_url': channel_url,
'video_id': video['id'],
'video_title': video['title'],
'match_score': 100
})
# Remove matched song from future consideration
del song_lookup[video_key]
song_keys.remove(video_key)
matches_this_channel += 1
# Add matches to plan
plan.extend(video_matches)
# Print match count once per channel
channel_match_counts[channel_name] = matches_this_channel
print(f" → Found {matches_this_channel} songlist matches in this channel.")
# Remaining unmatched songs
unmatched = list(song_lookup.values())
# Print summary table
print("\n📊 Channel match summary:")
for channel, count in channel_match_counts.items():
print(f" {channel}: {count} matches")
print(f" TOTAL: {sum(channel_match_counts.values())} matches across {len(channel_match_counts)} channels.")
return plan, unmatched

View File

@ -9,12 +9,30 @@ from karaoke_downloader.tracking_manager import TrackingManager, SongStatus, For
from karaoke_downloader.id3_utils import add_id3_tags, extract_artist_title from karaoke_downloader.id3_utils import add_id3_tags, extract_artist_title
from karaoke_downloader.songlist_manager import ( from karaoke_downloader.songlist_manager import (
load_songlist, load_songlist_tracking, save_songlist_tracking, load_songlist, load_songlist_tracking, save_songlist_tracking,
is_songlist_song_downloaded, mark_songlist_song_downloaded, normalize_title, is_songlist_song_downloaded, mark_songlist_song_downloaded, normalize_title
load_server_songs, is_song_on_server )
from karaoke_downloader.server_manager import (
load_server_songs, is_song_on_server, load_server_duplicates_tracking,
check_and_mark_server_duplicate, is_song_marked_as_server_duplicate
) )
from karaoke_downloader.youtube_utils import get_channel_info, get_playlist_info from karaoke_downloader.youtube_utils import get_channel_info, get_playlist_info
from karaoke_downloader.fuzzy_matcher import get_similarity_function, is_fuzzy_match, is_exact_match, create_song_key, create_video_key
import logging import logging
import hashlib import hashlib
from karaoke_downloader.download_planner import build_download_plan
from karaoke_downloader.cache_manager import (
get_download_plan_cache_file, load_cached_plan, save_plan_cache, delete_plan_cache
)
from karaoke_downloader.video_downloader import download_video_and_track, is_valid_mp4, execute_download_plan
from karaoke_downloader.channel_manager import reset_channel_downloads, download_from_file
# Constants
DEFAULT_FUZZY_THRESHOLD = 85
DEFAULT_CACHE_EXPIRATION_DAYS = 1
DEFAULT_FILENAME_LENGTH_LIMIT = 100
DEFAULT_ARTIST_LENGTH_LIMIT = 30
DEFAULT_TITLE_LENGTH_LIMIT = 60
DEFAULT_DISPLAY_LIMIT = 10
DATA_DIR = Path("data") DATA_DIR = Path("data")
@ -75,95 +93,87 @@ class KaraokeDownloader:
"yt_dlp_path": "downloader/yt-dlp.exe" "yt_dlp_path": "downloader/yt-dlp.exe"
} }
def reset_channel_downloads(self, channel_name, reset_songlist=False, delete_files=False): def _should_skip_song(self, artist, title, channel_name, video_id, video_title, server_songs=None, server_duplicates_tracking=None):
""" """
Reset all tracking and optionally files for a channel. Centralized method to check if a song should be skipped.
If reset_songlist is False, songlist songs are preserved (tracking and files). Performs four checks in order:
If reset_songlist is True, songlist songs for this channel are also reset/deleted. 1. Already downloaded (tracking)
2. File exists on filesystem
3. Already on server
4. Previously failed download (bad file)
Returns:
tuple: (should_skip, reason, total_filtered)
""" """
print(f"\n🔄 Resetting channel: {channel_name} (reset_songlist={reset_songlist}, delete_files={delete_files})") total_filtered = 0
# Find channel_id from channel_name
channel_id = None # Check 1: Already downloaded by this system
for pid, playlist in self.tracker.data.get('playlists', {}).items(): if self.tracker.is_song_downloaded(artist, title, channel_name, video_id):
if playlist['name'] == channel_name or pid == channel_name: return True, "already downloaded", total_filtered
channel_id = pid
break # Check 2: File already exists on filesystem
if not channel_id: # Generate the expected filename based on the download mode context
print(f"❌ Channel '{channel_name}' not found in tracking.") safe_title = title
return invalid_chars = ['?', ':', '*', '"', '<', '>', '|', '/', '\\']
# Get all songs for this channel for char in invalid_chars:
songs_to_reset = [] safe_title = safe_title.replace(char, "")
for song_id, song in self.tracker.data.get('songs', {}).items(): safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip()
if song['playlist_id'] == channel_id:
# Check if this is a songlist song # Try different filename patterns that might exist
artist, title = song.get('artist', ''), song.get('title', song.get('name', '')) possible_filenames = [
key = f"{artist.lower()}_{normalize_title(title)}" f"{artist} - {safe_title}.mp4", # Songlist mode
is_songlist = key in self.songlist_tracking f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
if is_songlist and not reset_songlist: f"{artist} - {safe_title} (Karaoke Version).mp4" # Channel videos mode
continue # skip songlist songs if not resetting them ]
songs_to_reset.append((song_id, song, is_songlist))
# Reset tracking and optionally delete files for filename in possible_filenames:
files_preserved = 0 if len(filename) > DEFAULT_FILENAME_LENGTH_LIMIT:
files_deleted = 0 # Apply length limits if needed
for song_id, song, is_songlist in songs_to_reset: safe_artist = artist.replace("'", "").replace('"', "").strip()
# Remove from main tracking filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
self.tracker.data['songs'][song_id]['status'] = 'NOT_DOWNLOADED'
self.tracker.data['songs'][song_id]['formats'] = {} output_path = self.downloads_dir / channel_name / filename
self.tracker.data['songs'][song_id]['last_error'] = '' if output_path.exists() and output_path.stat().st_size > 0:
self.tracker.data['songs'][song_id]['download_attempts'] = 0 return True, "file exists", total_filtered
self.tracker.data['songs'][song_id]['last_updated'] = None
# Remove from songlist tracking if needed # Check 3: Already on server (if server data provided)
if is_songlist and reset_songlist: if server_songs is not None and server_duplicates_tracking is not None:
artist, title = song.get('artist', ''), song.get('title', song.get('name', '')) from karaoke_downloader.server_manager import check_and_mark_server_duplicate
key = f"{artist.lower()}_{normalize_title(title)}" if check_and_mark_server_duplicate(server_songs, server_duplicates_tracking, artist, title, video_title, channel_name):
if key in self.songlist_tracking: total_filtered += 1
del self.songlist_tracking[key] return True, "on server", total_filtered
# Delete file if requested
if delete_files: # Check 4: Previously failed download (bad file)
file_path = song.get('file_path') if self.tracker.is_song_failed(artist, title, channel_name, video_id):
if file_path: return True, "previously failed", total_filtered
try:
p = Path(file_path) return False, None, total_filtered
if p.exists():
p.unlink() def _mark_song_failed(self, artist, title, video_id, channel_name, error_message):
files_deleted += 1 """
else: Centralized method to mark a song as failed in tracking.
files_preserved += 1 """
except Exception as e: self.tracker.mark_song_failed(artist, title, video_id, channel_name, error_message)
print(f"⚠️ Could not delete file {file_path}: {e}") print(f"🏷️ Marked song as failed: {artist} - {title}")
# --- FIX: Remove all songlist_tracking entries for this channel if reset_songlist is True ---
if reset_songlist: def _handle_download_failure(self, artist, title, video_id, channel_name, error_type, error_details=""):
keys_to_remove = [k for k, v in self.songlist_tracking.items() if v.get('channel') == channel_name] """
for k in keys_to_remove: Centralized method to handle download failures.
del self.songlist_tracking[k]
# Save changes Args:
self.tracker.force_save() artist: Song artist
save_songlist_tracking(self.songlist_tracking, str(self.songlist_tracking_file)) title: Song title
print(f"✅ Reset {len(songs_to_reset)} songs for channel '{channel_name}'.") video_id: YouTube video ID
if delete_files: channel_name: Channel name
print(f" Files deleted: {files_deleted}, files preserved: {files_preserved}") error_type: Type of error (e.g., "yt-dlp failed", "file verification failed")
if not reset_songlist: error_details: Additional error details
print(f" Songlist songs were preserved.") """
error_msg = f"{error_type}"
if error_details:
error_msg += f": {error_details}"
self._mark_song_failed(artist, title, video_id, channel_name, error_msg)
def download_from_file(self, file_path, force_refresh=False): def download_channel_videos(self, url, force_refresh=False, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD):
file = Path(file_path)
if not file.exists():
print(f"❌ File not found: {file_path}")
return False
with open(file, "r", encoding="utf-8") as f:
urls = [line.strip() for line in f if line.strip() and not line.strip().startswith("#")]
if not urls:
print(f"❌ No URLs found in {file_path}")
return False
all_success = True
for url in urls:
print(f"\n➡️ Processing: {url}")
success = self.download_channel_videos(url, force_refresh=force_refresh)
if not success:
all_success = False
return all_success
def download_channel_videos(self, url, force_refresh=False, fuzzy_match=False, fuzzy_threshold=90):
"""Download videos from a channel or playlist URL, respecting songlist-only and limit flags. Supports fuzzy matching.""" """Download videos from a channel or playlist URL, respecting songlist-only and limit flags. Supports fuzzy matching."""
channel_name, channel_id = get_channel_info(url) channel_name, channel_id = get_channel_info(url)
print(f"\n🎬 Downloading from channel: {channel_name} ({url})") print(f"\n🎬 Downloading from channel: {channel_name} ({url})")
@ -171,6 +181,11 @@ class KaraokeDownloader:
if not songlist: if not songlist:
print("⚠️ No songlist loaded. Skipping.") print("⚠️ No songlist loaded. Skipping.")
return False return False
# Load server songs and duplicates tracking for availability checking
server_songs = load_server_songs()
server_duplicates_tracking = load_server_duplicates_tracking()
limit = self.config.get('limit', 1) limit = self.config.get('limit', 1)
cmd = [ cmd = [
str(self.yt_dlp_path), str(self.yt_dlp_path),
@ -191,21 +206,14 @@ class KaraokeDownloader:
title, video_id = parts[0].strip(), parts[1].strip() title, video_id = parts[0].strip(), parts[1].strip()
available_videos.append({'title': title, 'id': video_id}) available_videos.append({'title': title, 'id': video_id})
# Normalize songlist for matching # Normalize songlist for matching
try:
from rapidfuzz import fuzz
def similarity(a, b):
return fuzz.ratio(a, b)
except ImportError:
import difflib
def similarity(a, b):
return int(difflib.SequenceMatcher(None, a, b).ratio() * 100)
normalized_songlist = { normalized_songlist = {
f"{s['artist'].lower()}_{normalize_title(s['title'])}": s for s in songlist create_song_key(s['artist'], s['title']): s for s in songlist
} }
matches = [] matches = []
similarity = get_similarity_function()
for video in available_videos: for video in available_videos:
artist, title = extract_artist_title(video['title']) artist, title = extract_artist_title(video['title'])
key = f"{artist.lower()}_{normalize_title(title)}" key = create_song_key(artist, title)
if fuzzy_match: if fuzzy_match:
# Fuzzy match against all songlist keys # Fuzzy match against all songlist keys
best_score = 0 best_score = 0
@ -216,15 +224,26 @@ class KaraokeDownloader:
best_score = score best_score = score
best_song = song best_song = song
if best_score >= fuzzy_threshold and best_song: if best_score >= fuzzy_threshold and best_song:
# Check if already downloaded or on server
if not is_songlist_song_downloaded(self.songlist_tracking, best_song['artist'], best_song['title']): if not is_songlist_song_downloaded(self.songlist_tracking, best_song['artist'], best_song['title']):
matches.append((video, best_song)) # Check if already marked as server duplicate
print(f" → Fuzzy match: {artist} - {title} <-> {best_song['artist']} - {best_song['title']} (score: {best_score})") if not is_song_marked_as_server_duplicate(server_duplicates_tracking, best_song['artist'], best_song['title']):
# Check if already on server and mark for future skipping
if not check_and_mark_server_duplicate(server_songs, server_duplicates_tracking, best_song['artist'], best_song['title'], video['title'], channel_name):
matches.append((video, best_song))
print(f" → Fuzzy match: {artist} - {title} <-> {best_song['artist']} - {best_song['title']} (score: {best_score})")
if len(matches) >= limit: if len(matches) >= limit:
break break
else: else:
if key in normalized_songlist: if key in normalized_songlist:
if not is_songlist_song_downloaded(self.songlist_tracking, artist, title): song = normalized_songlist[key]
matches.append((video, normalized_songlist[key])) # Check if already downloaded or on server
if not is_songlist_song_downloaded(self.songlist_tracking, song['artist'], song['title']):
# Check if already marked as server duplicate
if not is_song_marked_as_server_duplicate(server_duplicates_tracking, song['artist'], song['title']):
# Check if already on server and mark for future skipping
if not check_and_mark_server_duplicate(server_songs, server_duplicates_tracking, song['artist'], song['title'], video['title'], channel_name):
matches.append((video, song))
if len(matches) >= limit: if len(matches) >= limit:
break break
if not matches: if not matches:
@ -247,12 +266,18 @@ class KaraokeDownloader:
subprocess.run(cmd, check=True) subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e: except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed: {e}") print(f"❌ yt-dlp failed: {e}")
# Mark song as failed in tracking immediately
self._handle_download_failure(artist, title, video['id'], channel_name, "yt-dlp failed", str(e))
continue continue
if not output_path.exists() or output_path.stat().st_size == 0: if not output_path.exists() or output_path.stat().st_size == 0:
print(f"❌ Download failed or file is empty: {output_path}") print(f"❌ Download failed or file is empty: {output_path}")
# Mark song as failed in tracking immediately
self._handle_download_failure(artist, title, video['id'], channel_name, "Download failed", "file does not exist or is empty")
continue continue
if not self._is_valid_mp4(output_path): if not is_valid_mp4(output_path):
print(f"❌ File is not a valid MP4: {output_path}") print(f"❌ File is not a valid MP4: {output_path}")
# Mark song as failed in tracking immediately
self._handle_download_failure(artist, title, video['id'], channel_name, "Download failed", "file is not a valid MP4")
continue continue
add_id3_tags(output_path, f"{artist} - {title} (Karaoke Version)", channel_name) add_id3_tags(output_path, f"{artist} - {title} (Karaoke Version)", channel_name)
mark_songlist_song_downloaded(self.songlist_tracking, artist, title, channel_name, output_path) mark_songlist_song_downloaded(self.songlist_tracking, artist, title, channel_name, output_path)
@ -260,107 +285,7 @@ class KaraokeDownloader:
print(f"🎉 All post-processing complete for: {output_path}") print(f"🎉 All post-processing complete for: {output_path}")
return True return True
def build_download_plan(self, channel_urls, undownloaded, fuzzy_match=False, fuzzy_threshold=90): def download_songlist_across_channels(self, channel_urls, limit=None, force_refresh_download_plan=False, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD):
"""
For each song in undownloaded, scan all channels for a match.
Use fuzzy matching if enabled.
Return (download_plan, unmatched_songs):
- download_plan: list of dicts {artist, title, channel_name, channel_url, video_id, video_title}
- unmatched_songs: list of songs not found in any channel
"""
try:
from rapidfuzz import fuzz
def similarity(a, b):
return fuzz.ratio(a, b)
except ImportError:
import difflib
def similarity(a, b):
return int(difflib.SequenceMatcher(None, a, b).ratio() * 100)
plan = []
unmatched = []
channel_match_counts = {}
for channel_url in channel_urls:
channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel: {channel_name} ({channel_url})")
available_videos = self.tracker.get_channel_video_list(
channel_url,
yt_dlp_path=str(self.yt_dlp_path),
force_refresh=False
)
matches_this_channel = 0
channel_fuzzy_matches = [] # For optional top-N reporting
for song in undownloaded:
artist, title = song['artist'], song['title']
found = False
song_key = f"{artist.lower()}_{normalize_title(title)}"
for video in available_videos:
v_artist, v_title = extract_artist_title(video['title'])
video_key = f"{v_artist.lower()}_{normalize_title(v_title)}"
if fuzzy_match:
score = similarity(song_key, video_key)
if score >= fuzzy_threshold:
if not any(p['artist'] == artist and p['title'] == title for p in plan):
plan.append({
'artist': artist,
'title': title,
'channel_name': channel_name,
'channel_url': channel_url,
'video_id': video['id'],
'video_title': video['title'],
'match_score': score
})
# print(f" → Match: \"{artist} - {title}\" <-> \"{video['title']}\" (score: {score})")
matches_this_channel += 1
found = True
break
else:
if (normalize_title(v_artist) == normalize_title(artist) and normalize_title(v_title) == normalize_title(title)) or \
(normalize_title(video['title']) == normalize_title(f"{artist} - {title}")):
if not any(p['artist'] == artist and p['title'] == title for p in plan):
plan.append({
'artist': artist,
'title': title,
'channel_name': channel_name,
'channel_url': channel_url,
'video_id': video['id'],
'video_title': video['title'],
'match_score': 100
})
# print(f" → Match: \"{artist} - {title}\" <-> \"{video['title']}\" (exact)")
matches_this_channel += 1
found = True
break
# Don't break here; keep looking for all matches in this channel
channel_match_counts[channel_name] = matches_this_channel
print(f" → Found {matches_this_channel} songlist matches in this channel.")
# Optionally, print top 3 fuzzy matches for review
# if fuzzy_match and channel_fuzzy_matches:
# top_matches = sorted(channel_fuzzy_matches, key=lambda x: -x[3])[:3]
# for a, t, vt, s in top_matches:
# print(f" Top match: {a} - {t} <-> {vt} (score: {s})")
# Now find unmatched songs
for song in undownloaded:
if not any(p['artist'] == song['artist'] and p['title'] == song['title'] for p in plan):
unmatched.append(song)
# Print summary table
print("\n📊 Channel match summary:")
for channel, count in channel_match_counts.items():
print(f" {channel}: {count} matches")
print(f" TOTAL: {sum(channel_match_counts.values())} matches across {len(channel_match_counts)} channels.")
return plan, unmatched
def get_download_plan_cache_file(self, mode, **kwargs):
"""Generate a unique cache filename based on mode and key parameters."""
parts = [f"plan_{mode}"]
for k, v in sorted(kwargs.items()):
parts.append(f"{k}{v}")
base = "_".join(parts)
# Hash for safety if string is long
if len(base) > 60:
base = base[:40] + "_" + hashlib.md5(base.encode()).hexdigest()
return Path(f"data/{base}.json")
def download_songlist_across_channels(self, channel_urls, limit=None, force_refresh_download_plan=False, fuzzy_match=False, fuzzy_threshold=90):
""" """
For each song in the songlist, try each channel in order and download from the first channel where it is found. For each song in the songlist, try each channel in order and download from the first channel where it is found.
Download up to 'limit' songs, skipping any that cannot be found, until the limit is reached or all possible matches are exhausted. Download up to 'limit' songs, skipping any that cannot be found, until the limit is reached or all possible matches are exhausted.
@ -371,35 +296,52 @@ class KaraokeDownloader:
return False return False
# Filter for songs not yet downloaded # Filter for songs not yet downloaded
undownloaded = [s for s in songlist if not is_songlist_song_downloaded(self.songlist_tracking, s['artist'], s['title'])] undownloaded = [s for s in songlist if not is_songlist_song_downloaded(self.songlist_tracking, s['artist'], s['title'])]
print(f"🎯 {len(songlist)} total unique songs in songlist.") print(f"\n🎯 {len(songlist)} total unique songs in songlist.")
print(f"🎯 {len(undownloaded)} unique songlist songs to download.") print(f"\n🎯 {len(undownloaded)} unique songlist songs to download.")
# Further filter out songs already on server
not_on_server = [s for s in undownloaded if not is_song_on_server(self.server_songs, s['artist'], s['title'])] # Load server songs and duplicates tracking for availability checking
server_available = len(undownloaded) - len(not_on_server) server_songs = load_server_songs()
server_duplicates_tracking = load_server_duplicates_tracking()
# Further filter out songs already on server or marked as duplicates
not_on_server = []
server_available = 0
marked_duplicates = 0
for song in undownloaded:
artist, title = song['artist'], song['title']
# Check if already marked as server duplicate
if is_song_marked_as_server_duplicate(server_duplicates_tracking, artist, title):
marked_duplicates += 1
continue
# Check if already on server and mark for future skipping
if check_and_mark_server_duplicate(server_songs, server_duplicates_tracking, artist, title, f"{artist} - {title}", "songlist"):
server_available += 1
continue
not_on_server.append(song)
if server_available > 0: if server_available > 0:
print(f"🎵 {server_available} songs already available on server, skipping.") print(f"\n🎵 {server_available} songs already available on server, skipping.")
if marked_duplicates > 0:
print(f"\n🏷️ {marked_duplicates} songs previously marked as server duplicates, skipping.")
undownloaded = not_on_server undownloaded = not_on_server
print(f"🎯 {len(undownloaded)} songs need to be downloaded.") print(f"\n🎯 {len(undownloaded)} songs need to be downloaded.")
if not undownloaded: if not undownloaded:
print("🎵 All songlist songs already downloaded.") print("🎵 All songlist songs already downloaded.")
return True return True
# --- FAST MODE: Early exit and deduplication if limit is set --- # --- FAST MODE: Early exit and deduplication if limit is set ---
if limit is not None: if limit is not None:
print("\n⚡ Fast mode enabled: will stop as soon as limit is reached with successful downloads.") print("\n⚡ Fast mode enabled: will stop as soon as limit is reached with successful downloads.")
try: similarity = get_similarity_function()
from rapidfuzz import fuzz
def similarity(a, b):
return fuzz.ratio(a, b)
except ImportError:
import difflib
def similarity(a, b):
return int(difflib.SequenceMatcher(None, a, b).ratio() * 100)
downloaded_count = 0 downloaded_count = 0
unique_keys = set() unique_keys = set()
total_attempted = 0 total_attempted = 0
for channel_url in channel_urls: for channel_url in channel_urls:
channel_name, channel_id = get_channel_info(channel_url) channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel: {channel_name} ({channel_url})")
available_videos = self.tracker.get_channel_video_list( available_videos = self.tracker.get_channel_video_list(
channel_url, channel_url,
yt_dlp_path=str(self.yt_dlp_path), yt_dlp_path=str(self.yt_dlp_path),
@ -407,22 +349,28 @@ class KaraokeDownloader:
) )
for song in undownloaded: for song in undownloaded:
artist, title = song['artist'], song['title'] artist, title = song['artist'], song['title']
key = f"{artist.lower()}_{normalize_title(title)}" key = create_song_key(artist, title)
if key in unique_keys: if key in unique_keys:
continue # Already downloaded or queued continue # Already downloaded or queued
# Check if should skip this song during planning phase
should_skip, reason, _ = self._should_skip_song(
artist, title, channel_name, None, f"{artist} - {title}",
server_songs, server_duplicates_tracking
)
if should_skip:
continue
found = False found = False
for video in available_videos: for video in available_videos:
v_artist, v_title = extract_artist_title(video['title']) v_artist, v_title = extract_artist_title(video['title'])
video_key = f"{v_artist.lower()}_{normalize_title(v_title)}" video_key = create_song_key(v_artist, v_title)
if fuzzy_match: if fuzzy_match:
score = similarity(key, video_key) score = similarity(key, video_key)
if score >= fuzzy_threshold: if score >= fuzzy_threshold:
print(f" → Match: \"{artist} - {title}\" <-> \"{video['title']}\" (score: {score})")
found = True found = True
else: else:
if (normalize_title(v_artist) == normalize_title(artist) and normalize_title(v_title) == normalize_title(title)) or \ if is_exact_match(artist, title, video['title']):
(normalize_title(video['title']) == normalize_title(f"{artist} - {title}")):
print(f" → Match: \"{artist} - {title}\" <-> \"{video['title']}\" (exact)")
found = True found = True
if found: if found:
print(f"\n⬇️ Downloading {downloaded_count+1} of {limit}:") print(f"\n⬇️ Downloading {downloaded_count+1} of {limit}:")
@ -441,8 +389,18 @@ class KaraokeDownloader:
safe_artist = safe_artist.strip() safe_artist = safe_artist.strip()
filename = f"{safe_artist} - {safe_title}.mp4" filename = f"{safe_artist} - {safe_title}.mp4"
# Call the actual download function (simulate the same as in the plan loop) # Call the actual download function (simulate the same as in the plan loop)
success = self._download_video_and_track( success = download_video_and_track(
channel_name, channel_url, video['id'], video['title'], artist, title, filename self.yt_dlp_path,
self.config,
self.downloads_dir,
self.songlist_tracking,
channel_name,
channel_url,
video['id'],
video['title'],
artist,
title,
filename
) )
total_attempted += 1 total_attempted += 1
if success: if success:
@ -459,195 +417,89 @@ class KaraokeDownloader:
if downloaded_count < limit: if downloaded_count < limit:
print(f"⚠️ Only {downloaded_count} songs were downloaded. Some may not have been found or downloads failed.") print(f"⚠️ Only {downloaded_count} songs were downloaded. Some may not have been found or downloads failed.")
return True return True
# --- ORIGINAL FULL PLAN MODE (no limit) --- # --- ORIGINAL FULL PLAN MODE (no limit) ---
# Removed per-song printout for cleaner output
# print("🔍 Songs to search for:")
# for song in undownloaded:
# print(f" - {song['artist']} - {song['title']}")
# --- Download plan cache logic --- # --- Download plan cache logic ---
plan_mode = "songlist" plan_mode = "songlist"
plan_kwargs = {"limit": limit or "all", "channels": len(channel_urls)} # Include all parameters that affect the plan generation
cache_file = self.get_download_plan_cache_file(plan_mode, **plan_kwargs) plan_kwargs = {
"limit": limit or "all",
"channels": len(channel_urls),
"fuzzy": fuzzy_match,
"threshold": fuzzy_threshold
}
# Add channel URLs hash to ensure same channels = same cache
channels_hash = hashlib.md5("|".join(sorted(channel_urls)).encode()).hexdigest()[:8]
plan_kwargs["channels_hash"] = channels_hash
cache_file = get_download_plan_cache_file(plan_mode, **plan_kwargs)
use_cache = False use_cache = False
if not force_refresh_download_plan and cache_file.exists(): download_plan, unmatched = load_cached_plan(cache_file)
try: if not force_refresh_download_plan and download_plan is not None:
with open(cache_file, 'r', encoding='utf-8') as f: use_cache = True
cache_data = json.load(f)
cache_time = datetime.fromisoformat(cache_data.get('timestamp'))
if datetime.now() - cache_time < timedelta(days=1):
print(f"🗂️ Using cached download plan from {cache_time} ({cache_file.name}).")
download_plan = cache_data['download_plan']
unmatched = cache_data['unmatched']
use_cache = True
except Exception as e:
print(f"⚠️ Could not load download plan cache: {e}")
if not use_cache: if not use_cache:
print("\n🔎 Pre-scanning channels for matches...") print("\n🔍 Pre-scanning channels for matches...")
download_plan, unmatched = self.build_download_plan(channel_urls, undownloaded, fuzzy_match=fuzzy_match, fuzzy_threshold=fuzzy_threshold) download_plan, unmatched = build_download_plan(
if download_plan: channel_urls,
cache_data = { undownloaded,
'timestamp': datetime.now().isoformat(), self.tracker,
'download_plan': download_plan, self.yt_dlp_path,
'unmatched': unmatched fuzzy_match=fuzzy_match,
} fuzzy_threshold=fuzzy_threshold
with open(cache_file, 'w', encoding='utf-8') as f: )
json.dump(cache_data, f, indent=2, ensure_ascii=False) save_plan_cache(cache_file, download_plan, unmatched)
print(f"🗂️ Saved new download plan cache: {cache_file.name}")
else:
if cache_file.exists():
cache_file.unlink()
print(f"🗂️ No matches found, not saving download plan cache.")
print(f"\n📊 Download plan ready: {len(download_plan)} songs will be downloaded.") print(f"\n📊 Download plan ready: {len(download_plan)} songs will be downloaded.")
print(f"{len(unmatched)} songs could not be found in any channel.") print(f"{len(unmatched)} songs could not be found in any channel.")
if unmatched: if unmatched:
print("Unmatched songs:") print("Unmatched songs:")
for song in unmatched[:10]: for song in unmatched[:DEFAULT_DISPLAY_LIMIT]:
print(f" - {song['artist']} - {song['title']}") print(f" - {song['artist']} - {song['title']}")
if len(unmatched) > 10: if len(unmatched) > DEFAULT_DISPLAY_LIMIT:
print(f" ...and {len(unmatched)-10} more.") print(f" ...and {len(unmatched)-DEFAULT_DISPLAY_LIMIT} more.")
# --- Download phase --- # --- Download phase ---
downloaded_count = 0 downloaded_count, success = execute_download_plan(
total_to_download = limit if limit is not None else len(download_plan) download_plan=download_plan,
for idx, item in enumerate(download_plan): unmatched=unmatched,
if limit is not None and downloaded_count >= limit: cache_file=cache_file,
break config=self.config,
artist = item['artist'] yt_dlp_path=self.yt_dlp_path,
title = item['title'] downloads_dir=self.downloads_dir,
channel_name = item['channel_name'] songlist_tracking=self.songlist_tracking,
channel_url = item['channel_url'] limit=limit
video_id = item['video_id'] )
video_title = item['video_title'] return success
print(f"\n⬇️ Downloading {idx+1} of {total_to_download}:")
print(f" 📋 Songlist: {artist} - {title}")
print(f" 🎬 Video: {video_title} ({channel_name})")
if 'match_score' in item:
print(f" 🎯 Match Score: {item['match_score']:.1f}%")
# --- Existing download logic here, using channel_name, video_id, etc. ---
# (Copy the download logic from the previous loop, using these variables)
# Create a shorter, safer filename - do this ONCE and use consistently
safe_title = title.replace("(From ", "").replace(")", "").replace(" - ", " ").replace(":", "").replace("'", "").replace('"', "")
safe_artist = artist.replace("'", "").replace('"', "")
# Remove all Windows-invalid characters
invalid_chars = ['?', ':', '*', '"', '<', '>', '|', '/', '\\']
for char in invalid_chars:
safe_title = safe_title.replace(char, "")
safe_artist = safe_artist.replace(char, "")
# Also remove any other potentially problematic characters
safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip()
safe_artist = safe_artist.strip()
filename = f"{safe_artist} - {safe_title}.mp4"
# Limit filename length to avoid Windows path issues
if len(filename) > 100:
filename = f"{safe_artist[:30]} - {safe_title[:60]}.mp4"
output_path = self.downloads_dir / channel_name / filename
output_path.parent.mkdir(parents=True, exist_ok=True)
print(f"⬇️ Downloading: {artist} - {title} -> {output_path}")
video_url = f"https://www.youtube.com/watch?v={video_id}"
dlp_cmd = [
str(self.yt_dlp_path),
"--no-check-certificates",
"--ignore-errors",
"--no-warnings",
"-o", str(output_path),
"-f", self.config["download_settings"]["format"],
video_url
]
print(f"🔧 Running command: {' '.join(dlp_cmd)}")
print(f"📺 Resolution settings: {self.config.get('download_settings', {}).get('preferred_resolution', 'Unknown')}")
print(f"🎬 Format string: {self.config.get('download_settings', {}).get('format', 'Unknown')}")
# Debug: Show available formats (optional)
if self.config.get('debug_show_formats', False):
print(f"🔍 Checking available formats for: {video_url}")
format_cmd = [
str(self.yt_dlp_path),
"--list-formats",
video_url
]
try:
format_result = subprocess.run(format_cmd, capture_output=True, text=True, timeout=30)
print(f"📋 Available formats:\n{format_result.stdout}")
except Exception as e:
print(f"⚠️ Could not check formats: {e}")
try:
result = subprocess.run(dlp_cmd, capture_output=True, text=True, check=True)
print(f"✅ yt-dlp completed successfully")
print(f"📄 yt-dlp stdout: {result.stdout}")
except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed with exit code {e.returncode}")
print(f"❌ yt-dlp stderr: {e.stderr}")
continue
if not output_path.exists():
print(f"❌ Download failed: file does not exist: {output_path}")
# Check if yt-dlp saved it somewhere else
possible_files = list(output_path.parent.glob("*.mp4"))
if possible_files:
print(f"🔍 Found these files in the directory: {[f.name for f in possible_files]}")
# Look for a file that matches our pattern (artist - title)
artist_part = safe_artist.lower()
title_part = safe_title.lower()
for file in possible_files:
file_lower = file.stem.lower()
if artist_part in file_lower and any(word in file_lower for word in title_part.split()):
print(f"🎯 Found matching file: {file.name}")
output_path = file
break
else:
print(f"❌ No matching file found for: {artist} - {title}")
continue
else:
continue
if output_path.stat().st_size == 0:
print(f"❌ Download failed: file is empty (0 bytes): {output_path}")
continue
# TEMP: Skipping MP4 validation for debugging
# if not self._is_valid_mp4(output_path):
# print(f"❌ File is not a valid MP4: {output_path}")
# continue
add_id3_tags(output_path, f"{artist} - {title} (Karaoke Version)", channel_name)
mark_songlist_song_downloaded(self.songlist_tracking, artist, title, channel_name, output_path)
print(f"✅ Downloaded and tracked: {artist} - {title}")
print(f"🎉 All post-processing complete for: {output_path}")
downloaded_count += 1
# After each download, if this was the last song, delete the cache
if idx + 1 == total_to_download:
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted download plan cache after last song downloaded: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete download plan cache: {e}")
print(f"🎉 Downloaded {downloaded_count} songlist songs.")
print(f"📊 Summary: Processed {len(channel_urls)} channels, found {downloaded_count} songs, {len(unmatched)} songs not found.")
# Delete the download plan cache if all planned downloads are done
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted download plan cache after completion: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete download plan cache: {e}")
return True
def download_latest_per_channel(self, channel_urls, limit=5, force_refresh_download_plan=False): def download_latest_per_channel(self, channel_urls, limit=5, force_refresh_download_plan=False, fuzzy_match=False, fuzzy_threshold=DEFAULT_FUZZY_THRESHOLD):
""" """
Download the latest N videos from each channel in channel_urls. Download the latest N videos from each channel in channel_urls.
- Pre-scan all channels for their latest N videos. - Pre-scan all channels for their latest N videos.
- Check against local songs file to avoid duplicates.
- Build a per-channel download plan and cache it. - Build a per-channel download plan and cache it.
- Resume robustly if interrupted (removes each channel from the plan as it completes). - Resume robustly if interrupted (removes each channel from the plan as it completes).
- Deletes the plan cache when all channels are done. - Deletes the plan cache when all channels are done.
""" """
# Load server songs for availability checking
server_songs = load_server_songs()
server_duplicates_tracking = load_server_duplicates_tracking()
plan_mode = "latest_per_channel" plan_mode = "latest_per_channel"
plan_kwargs = {"limit": limit, "channels": len(channel_urls)} # Include all parameters that affect the plan generation
cache_file = self.get_download_plan_cache_file(plan_mode, **plan_kwargs) plan_kwargs = {
"limit": limit,
"channels": len(channel_urls),
"fuzzy": fuzzy_match,
"threshold": fuzzy_threshold
}
# Add channel URLs hash to ensure same channels = same cache
channels_hash = hashlib.md5("|".join(sorted(channel_urls)).encode()).hexdigest()[:8]
plan_kwargs["channels_hash"] = channels_hash
cache_file = get_download_plan_cache_file(plan_mode, **plan_kwargs)
use_cache = False use_cache = False
if not force_refresh_download_plan and cache_file.exists(): if not force_refresh_download_plan and cache_file.exists():
try: try:
with open(cache_file, 'r', encoding='utf-8') as f: with open(cache_file, 'r', encoding='utf-8') as f:
plan_data = json.load(f) plan_data = json.load(f)
cache_time = datetime.fromisoformat(plan_data.get('timestamp')) cache_time = datetime.fromisoformat(plan_data.get('timestamp'))
if datetime.now() - cache_time < timedelta(days=1): if datetime.now() - cache_time < timedelta(days=DEFAULT_CACHE_EXPIRATION_DAYS):
print(f"🗂️ Using cached latest-per-channel plan from {cache_time} ({cache_file.name}).") print(f"🗂️ Using cached latest-per-channel plan from {cache_time} ({cache_file.name}).")
channel_plans = plan_data['channel_plans'] channel_plans = plan_data['channel_plans']
use_cache = True use_cache = True
@ -656,6 +508,10 @@ class KaraokeDownloader:
if not use_cache: if not use_cache:
print("\n🔎 Pre-scanning all channels for latest videos...") print("\n🔎 Pre-scanning all channels for latest videos...")
channel_plans = [] channel_plans = []
total_found = 0
total_filtered = 0
total_marked = 0
for channel_url in channel_urls: for channel_url in channel_urls:
channel_name, channel_id = get_channel_info(channel_url) channel_name, channel_id = get_channel_info(channel_url)
print(f"\n🚦 Starting channel: {channel_name} ({channel_url})") print(f"\n🚦 Starting channel: {channel_name} ({channel_url})")
@ -664,14 +520,58 @@ class KaraokeDownloader:
yt_dlp_path=str(self.yt_dlp_path), yt_dlp_path=str(self.yt_dlp_path),
force_refresh=False force_refresh=False
) )
# Sort by upload order (assume yt-dlp returns in order, or sort by id if available) print(f" → Found {len(available_videos)} total videos for this channel.")
latest_videos = available_videos[:limit]
print(f" → Found {len(latest_videos)} latest videos for this channel.") # Pre-filter: Create a set of known duplicate keys for O(1) lookup
known_duplicate_keys = set()
for song_key in server_duplicates_tracking.keys():
known_duplicate_keys.add(song_key)
# Pre-filter videos to exclude known duplicates before processing
pre_filtered_videos = []
for video in available_videos:
artist, title = extract_artist_title(video['title'])
song_key = create_song_key(artist, title)
if song_key not in known_duplicate_keys:
pre_filtered_videos.append(video)
print(f" → After pre-filtering: {len(pre_filtered_videos)} videos not previously marked as duplicates.")
# Process videos until we reach the limit for this channel
filtered_videos = []
videos_checked = 0
for video in pre_filtered_videos:
if len(filtered_videos) >= limit:
break # We have enough videos for this channel
videos_checked += 1
artist, title = extract_artist_title(video['title'])
# Check if should skip this song during planning phase
should_skip, reason, filtered_count = self._should_skip_song(
artist, title, channel_name, video['id'], video['title'],
server_songs, server_duplicates_tracking
)
if should_skip:
total_filtered += 1
if reason == "on server":
total_marked += filtered_count
continue
filtered_videos.append(video)
print(f" → After processing: {len(filtered_videos)} videos to download (checked {videos_checked} videos, filtered out {videos_checked - len(filtered_videos)} already on server).")
total_found += len(filtered_videos)
channel_plans.append({ channel_plans.append({
'channel_name': channel_name, 'channel_name': channel_name,
'channel_url': channel_url, 'channel_url': channel_url,
'videos': latest_videos 'videos': filtered_videos
}) })
print(f"\n📊 Summary: {total_found} videos to download across {len(channel_plans)} channels (filtered out {total_filtered} already on server, marked {total_marked} new duplicates for future skipping).")
plan_data = { plan_data = {
'timestamp': datetime.now().isoformat(), 'timestamp': datetime.now().isoformat(),
'channel_plans': channel_plans 'channel_plans': channel_plans
@ -696,8 +596,9 @@ class KaraokeDownloader:
safe_title = safe_title.replace(char, "") safe_title = safe_title.replace(char, "")
safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip() safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip()
filename = f"{channel_name} - {safe_title}.mp4" filename = f"{channel_name} - {safe_title}.mp4"
if len(filename) > 100: # Limit filename length to avoid Windows path issues
filename = f"{channel_name[:30]} - {safe_title[:60]}.mp4" if len(filename) > DEFAULT_FILENAME_LENGTH_LIMIT:
filename = f"{channel_name[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
output_path = self.downloads_dir / channel_name / filename output_path = self.downloads_dir / channel_name / filename
output_path.parent.mkdir(parents=True, exist_ok=True) output_path.parent.mkdir(parents=True, exist_ok=True)
print(f" ({v_idx+1}/{len(videos)}) Downloading: {title} -> {output_path}") print(f" ({v_idx+1}/{len(videos)}) Downloading: {title} -> {output_path}")
@ -717,11 +618,27 @@ class KaraokeDownloader:
except subprocess.CalledProcessError as e: except subprocess.CalledProcessError as e:
print(f" ❌ yt-dlp failed with exit code {e.returncode}") print(f" ❌ yt-dlp failed with exit code {e.returncode}")
print(f" ❌ yt-dlp stderr: {e.stderr}") print(f" ❌ yt-dlp stderr: {e.stderr}")
# Mark song as failed in tracking immediately
artist, title_clean = extract_artist_title(title)
self._handle_download_failure(artist, title_clean, video_id, channel_name, "yt-dlp failed", f"exit code {e.returncode}: {e.stderr}")
continue continue
if not output_path.exists() or output_path.stat().st_size == 0: if not output_path.exists() or output_path.stat().st_size == 0:
print(f" ❌ Download failed or file is empty: {output_path}") print(f" ❌ Download failed or file is empty: {output_path}")
# Mark song as failed in tracking immediately
artist, title_clean = extract_artist_title(title)
self._handle_download_failure(artist, title_clean, video_id, channel_name, "Download failed", "file does not exist or is empty")
continue continue
# Extract artist and title for tracking
artist, title_clean = extract_artist_title(title)
# Add ID3 tags
add_id3_tags(output_path, title, channel_name) add_id3_tags(output_path, title, channel_name)
# Mark as downloaded in tracking system
file_size = output_path.stat().st_size if output_path.exists() else None
self.tracker.mark_song_downloaded(artist, title_clean, video_id, channel_name, output_path, file_size)
print(f" ✅ Downloaded and tagged: {title}") print(f" ✅ Downloaded and tagged: {title}")
# After channel is done, remove it from the plan and update cache # After channel is done, remove it from the plan and update cache
channel_plans[idx]['videos'] = [] channel_plans[idx]['videos'] = []
@ -738,58 +655,6 @@ class KaraokeDownloader:
print(f"🎉 All latest videos downloaded for all channels!") print(f"🎉 All latest videos downloaded for all channels!")
return True return True
def _is_valid_mp4(self, file_path):
"""Check if the file is a valid MP4 using ffprobe, if available."""
try:
cmd = ["ffprobe", "-v", "error", "-select_streams", "v:0", "-show_entries", "stream=codec_name", "-of", "default=noprint_wrappers=1:nokey=1", str(file_path)]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return "mp4" in result.stdout or "h264" in result.stdout or "hevc" in result.stdout
except Exception:
# If ffprobe is not available, skip the check
return True
def _download_video_and_track(self, channel_name, channel_url, video_id, video_title, artist, title, filename):
"""
Helper to download a single video and track its status.
Returns True if successful, False otherwise.
"""
output_path = self.downloads_dir / channel_name / filename
output_path.parent.mkdir(parents=True, exist_ok=True)
print(f"⬇️ Downloading: {artist} - {title} -> {output_path}")
video_url = f"https://www.youtube.com/watch?v={video_id}"
dlp_cmd = [
str(self.yt_dlp_path),
"--no-check-certificates",
"--ignore-errors",
"--no-warnings",
"-o", str(output_path),
"-f", self.config["download_settings"]["format"],
video_url
]
try:
result = subprocess.run(dlp_cmd, capture_output=True, text=True, check=True)
print(f"✅ yt-dlp completed successfully")
print(f"📄 yt-dlp stdout: {result.stdout}")
except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed with exit code {e.returncode}")
print(f"❌ yt-dlp stderr: {e.stderr}")
return False
if not output_path.exists():
print(f"❌ Download failed: file does not exist: {output_path}")
return False
if output_path.stat().st_size == 0:
print(f"❌ Download failed: file is empty (0 bytes): {output_path}")
return False
# TEMP: Skipping MP4 validation for debugging
# if not self._is_valid_mp4(output_path):
# print(f"❌ File is not a valid MP4: {output_path}")
# return False
add_id3_tags(output_path, f"{artist} - {title} (Karaoke Version)", channel_name)
mark_songlist_song_downloaded(self.songlist_tracking, artist, title, channel_name, output_path)
print(f"✅ Downloaded and tracked: {artist} - {title}")
print(f"🎉 All post-processing complete for: {output_path}")
return True
def reset_songlist_all(): def reset_songlist_all():
"""Delete all files tracked in songlist_tracking.json, clear songlist_tracking.json, and remove songlist songs from karaoke_tracking.json.""" """Delete all files tracked in songlist_tracking.json, clear songlist_tracking.json, and remove songlist songs from karaoke_tracking.json."""
import json import json

View File

@ -0,0 +1,87 @@
"""
Fuzzy matching utilities for songlist-to-video matching.
Handles similarity calculations and match validation.
"""
def get_similarity_function():
"""
Get the best available similarity function.
Returns rapidfuzz if available, otherwise falls back to difflib.
"""
try:
from rapidfuzz import fuzz
def similarity(a, b):
return fuzz.ratio(a, b)
return similarity
except ImportError:
import difflib
def similarity(a, b):
return int(difflib.SequenceMatcher(None, a, b).ratio() * 100)
return similarity
def normalize_title(title):
"""Normalize a title for comparison."""
normalized = title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip()
return " ".join(normalized.split()).lower()
def extract_artist_title(video_title):
"""Extract artist and title from video title."""
if " - " in video_title:
parts = video_title.split(" - ", 1)
return parts[0].strip(), parts[1].strip()
return "", video_title
def create_song_key(artist, title):
"""Create a normalized key for song comparison."""
return f"{artist.lower()}_{normalize_title(title)}"
def create_video_key(video_title):
"""Create a normalized key for video comparison."""
artist, title = extract_artist_title(video_title)
return f"{artist.lower()}_{normalize_title(title)}"
def is_fuzzy_match(songlist_artist, songlist_title, video_title, threshold=90):
"""
Check if a songlist entry matches a video title using fuzzy matching.
Args:
songlist_artist: Artist from songlist
songlist_title: Title from songlist
video_title: YouTube video title
threshold: Minimum similarity score (0-100)
Returns:
tuple: (is_match, score) where is_match is boolean and score is the similarity score
"""
similarity = get_similarity_function()
song_key = create_song_key(songlist_artist, songlist_title)
video_key = create_video_key(video_title)
score = similarity(song_key, video_key)
is_match = score >= threshold
return is_match, score
def is_exact_match(songlist_artist, songlist_title, video_title):
"""
Check if a songlist entry exactly matches a video title.
Args:
songlist_artist: Artist from songlist
songlist_title: Title from songlist
video_title: YouTube video title
Returns:
bool: True if exact match, False otherwise
"""
v_artist, v_title = extract_artist_title(video_title)
# Check artist and title separately
artist_match = normalize_title(v_artist) == normalize_title(songlist_artist)
title_match = normalize_title(v_title) == normalize_title(songlist_title)
# Also check if video title matches "artist - title" format
full_title_match = normalize_title(video_title) == normalize_title(f"{songlist_artist} - {songlist_title}")
return (artist_match and title_match) or full_title_match

View File

@ -0,0 +1,86 @@
"""
Server management utilities.
Handles server songs loading and server duplicates tracking.
"""
import json
from pathlib import Path
from datetime import datetime
def load_server_songs(songs_path="data/songs.json"):
"""Load the list of songs already available on the server."""
songs_file = Path(songs_path)
if not songs_file.exists():
print(f"⚠️ Server songs file not found: {songs_path}")
return set()
try:
with open(songs_file, 'r', encoding='utf-8') as f:
data = json.load(f)
server_songs = set()
for song in data:
if "artist" in song and "title" in song:
artist = song["artist"].strip()
title = song["title"].strip()
key = f"{artist.lower()}_{normalize_title(title)}"
server_songs.add(key)
print(f"📋 Loaded {len(server_songs)} songs from server (songs.json)")
return server_songs
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load server songs: {e}")
return set()
def is_song_on_server(server_songs, artist, title):
"""Check if a song is already available on the server."""
key = f"{artist.lower()}_{normalize_title(title)}"
return key in server_songs
def load_server_duplicates_tracking(tracking_path="data/server_duplicates_tracking.json"):
"""Load the tracking of songs found to be duplicates on the server."""
tracking_file = Path(tracking_path)
if not tracking_file.exists():
return {}
try:
with open(tracking_file, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load server duplicates tracking: {e}")
return {}
def save_server_duplicates_tracking(tracking, tracking_path="data/server_duplicates_tracking.json"):
"""Save the tracking of songs found to be duplicates on the server."""
try:
with open(tracking_path, 'w', encoding='utf-8') as f:
json.dump(tracking, f, indent=2, ensure_ascii=False)
except Exception as e:
print(f"⚠️ Could not save server duplicates tracking: {e}")
def is_song_marked_as_server_duplicate(tracking, artist, title):
"""Check if a song has been marked as a server duplicate."""
key = f"{artist.lower()}_{normalize_title(title)}"
return key in tracking
def mark_song_as_server_duplicate(tracking, artist, title, video_title, channel_name):
"""Mark a song as a server duplicate for future skipping."""
key = f"{artist.lower()}_{normalize_title(title)}"
tracking[key] = {
"artist": artist,
"title": title,
"video_title": video_title,
"channel": channel_name,
"marked_at": datetime.now().isoformat(),
"reason": "already_on_server"
}
save_server_duplicates_tracking(tracking)
def check_and_mark_server_duplicate(server_songs, server_duplicates_tracking, artist, title, video_title, channel_name):
"""Check if a song is on server and mark it as duplicate if so. Returns True if it's a duplicate."""
if is_song_on_server(server_songs, artist, title):
if not is_song_marked_as_server_duplicate(server_duplicates_tracking, artist, title):
mark_song_as_server_duplicate(server_duplicates_tracking, artist, title, video_title, channel_name)
return True
return False
def normalize_title(title):
"""Normalize a title for consistent key generation."""
normalized = title.replace("(Karaoke Version)", "").replace("(Karaoke)", "").strip()
return " ".join(normalized.split()).lower()

View File

@ -1,6 +1,15 @@
"""
Songlist management utilities.
Handles songlist loading, tracking, and songlist-specific operations.
"""
import json import json
from pathlib import Path from pathlib import Path
from datetime import datetime from datetime import datetime
from karaoke_downloader.server_manager import (
load_server_songs, is_song_on_server, load_server_duplicates_tracking,
check_and_mark_server_duplicate, is_song_marked_as_server_duplicate
)
def load_songlist(songlist_path="data/songList.json"): def load_songlist(songlist_path="data/songList.json"):
songlist_file = Path(songlist_path) songlist_file = Path(songlist_path)
@ -68,31 +77,4 @@ def mark_songlist_song_downloaded(tracking, artist, title, channel_name, file_pa
"file_path": str(file_path), "file_path": str(file_path),
"downloaded_at": datetime.now().isoformat() "downloaded_at": datetime.now().isoformat()
} }
save_songlist_tracking(tracking) save_songlist_tracking(tracking)
def load_server_songs(songs_path="data/songs.json"):
"""Load the list of songs already available on the server."""
songs_file = Path(songs_path)
if not songs_file.exists():
print(f"⚠️ Server songs file not found: {songs_path}")
return set()
try:
with open(songs_file, 'r', encoding='utf-8') as f:
data = json.load(f)
server_songs = set()
for song in data:
if "artist" in song and "title" in song:
artist = song["artist"].strip()
title = song["title"].strip()
key = f"{artist.lower()}_{normalize_title(title)}"
server_songs.add(key)
print(f"📋 Loaded {len(server_songs)} songs from server (songs.json)")
return server_songs
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"⚠️ Could not load server songs: {e}")
return set()
def is_song_on_server(server_songs, artist, title):
"""Check if a song is already available on the server."""
key = f"{artist.lower()}_{normalize_title(title)}"
return key in server_songs

View File

@ -135,6 +135,105 @@ class TrackingManager:
} }
return json.dumps(report, indent=2, ensure_ascii=False) return json.dumps(report, indent=2, ensure_ascii=False)
def is_song_downloaded(self, artist, title, channel_name=None, video_id=None):
"""
Check if a song has already been downloaded by this system.
Returns True if the song exists in tracking with DOWNLOADED or CONVERTED status.
"""
# If we have video_id and channel_name, try direct key lookup first (most efficient)
if video_id and channel_name:
song_key = f"{video_id}@{channel_name}"
if song_key in self.data['songs']:
song_data = self.data['songs'][song_key]
if song_data.get('status') in [SongStatus.DOWNLOADED, SongStatus.CONVERTED]:
return True
# Fallback to content search (for cases where we don't have video_id)
for song_id, song_data in self.data['songs'].items():
# Check if this song matches the artist and title
if song_data.get('artist') == artist and song_data.get('title') == title:
# Check if it's marked as downloaded
if song_data.get('status') in [SongStatus.DOWNLOADED, SongStatus.CONVERTED]:
return True
# Also check the video title field which might contain the song info
video_title = song_data.get('video_title', '')
if video_title and artist in video_title and title in video_title:
if song_data.get('status') in [SongStatus.DOWNLOADED, SongStatus.CONVERTED]:
return True
return False
def is_file_exists(self, file_path):
"""
Check if a file already exists on the filesystem.
"""
return Path(file_path).exists()
def is_song_failed(self, artist, title, channel_name=None, video_id=None):
"""
Check if a song has previously failed to download.
Returns True if the song exists in tracking with FAILED status.
"""
# If we have video_id and channel_name, try direct key lookup first (most efficient)
if video_id and channel_name:
song_key = f"{video_id}@{channel_name}"
if song_key in self.data['songs']:
song_data = self.data['songs'][song_key]
if song_data.get('status') == SongStatus.FAILED:
return True
# Fallback to content search (for cases where we don't have video_id)
for song_id, song_data in self.data['songs'].items():
# Check if this song matches the artist and title
if song_data.get('artist') == artist and song_data.get('title') == title:
# Check if it's marked as failed
if song_data.get('status') == SongStatus.FAILED:
return True
# Also check the video title field which might contain the song info
video_title = song_data.get('video_title', '')
if video_title and artist in video_title and title in video_title:
if song_data.get('status') == SongStatus.FAILED:
return True
return False
def mark_song_downloaded(self, artist, title, video_id, channel_name, file_path, file_size=None):
"""
Mark a song as downloaded in the tracking system.
"""
# Use the existing tracking structure: video_id@channel_name
song_key = f"{video_id}@{channel_name}"
self.data['songs'][song_key] = {
'artist': artist,
'title': title,
'video_id': video_id,
'channel_name': channel_name,
'video_title': f"{artist} - {title}",
'file_path': str(file_path),
'file_size': file_size,
'status': SongStatus.DOWNLOADED,
'last_updated': datetime.now().isoformat()
}
self._save()
def mark_song_failed(self, artist, title, video_id, channel_name, error_message=None):
"""
Mark a song as failed in the tracking system.
"""
# Use the existing tracking structure: video_id@channel_name
song_key = f"{video_id}@{channel_name}"
self.data['songs'][song_key] = {
'artist': artist,
'title': title,
'video_id': video_id,
'channel_name': channel_name,
'video_title': f"{artist} - {title}",
'status': SongStatus.FAILED,
'error_message': error_message,
'last_updated': datetime.now().isoformat()
}
self._save()
def get_channel_video_list(self, channel_url, yt_dlp_path="downloader/yt-dlp.exe", force_refresh=False): def get_channel_video_list(self, channel_url, yt_dlp_path="downloader/yt-dlp.exe", force_refresh=False):
""" """
Return a list of videos (dicts with 'title' and 'id') for the channel, using cache if available unless force_refresh is True. Return a list of videos (dicts with 'title' and 'id') for the channel, using cache if available unless force_refresh is True.

View File

@ -0,0 +1,327 @@
"""
Core video download logic and file validation.
Handles the actual downloading and post-processing of videos.
"""
import subprocess
from pathlib import Path
from karaoke_downloader.id3_utils import add_id3_tags
from karaoke_downloader.songlist_manager import mark_songlist_song_downloaded
from karaoke_downloader.download_planner import save_plan_cache
# Constants
DEFAULT_FILENAME_LENGTH_LIMIT = 100
DEFAULT_ARTIST_LENGTH_LIMIT = 30
DEFAULT_TITLE_LENGTH_LIMIT = 60
DEFAULT_FORMAT_CHECK_TIMEOUT = 30
def sanitize_filename(artist, title):
"""
Create a safe filename from artist and title.
Removes invalid characters and limits length.
"""
# Create a shorter, safer filename
safe_title = title.replace("(From ", "").replace(")", "").replace(" - ", " ").replace(":", "").replace("'", "").replace('"', "")
safe_artist = artist.replace("'", "").replace('"', "")
# Remove all Windows-invalid characters
invalid_chars = ['?', ':', '*', '"', '<', '>', '|', '/', '\\']
for char in invalid_chars:
safe_title = safe_title.replace(char, "")
safe_artist = safe_artist.replace(char, "")
# Also remove any other potentially problematic characters
safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip()
safe_artist = safe_artist.strip()
filename = f"{safe_artist} - {safe_title}.mp4"
# Limit filename length to avoid Windows path issues
if len(filename) > DEFAULT_FILENAME_LENGTH_LIMIT:
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
return filename
def is_valid_mp4(file_path):
"""
Check if a file is a valid MP4 file.
Uses ffprobe if available, otherwise checks file extension and size.
"""
if not file_path.exists():
return False
# Check file size
if file_path.stat().st_size == 0:
return False
# Try to use ffprobe for validation
try:
import subprocess
result = subprocess.run(
['ffprobe', '-v', 'quiet', '-print_format', 'json', '-show_format', str(file_path)],
capture_output=True,
text=True,
check=True
)
return True
except (subprocess.CalledProcessError, FileNotFoundError):
# If ffprobe is not available, just check the extension and size
return file_path.suffix.lower() == '.mp4' and file_path.stat().st_size > 0
def download_video_and_track(yt_dlp_path, config, downloads_dir, songlist_tracking,
channel_name, channel_url, video_id, video_title,
artist, title, filename):
"""
Download a single video and track its status.
Returns True if successful, False otherwise.
"""
output_path = downloads_dir / channel_name / filename
return download_single_video(
output_path, video_id, config, yt_dlp_path,
artist, title, channel_name, songlist_tracking
)
def download_single_video(output_path, video_id, config, yt_dlp_path,
artist, title, channel_name, songlist_tracking):
"""Download a single video and handle post-processing."""
output_path.parent.mkdir(parents=True, exist_ok=True)
print(f"⬇️ Downloading: {artist} - {title} -> {output_path}")
video_url = f"https://www.youtube.com/watch?v={video_id}"
dlp_cmd = [
str(yt_dlp_path),
"--no-check-certificates",
"--ignore-errors",
"--no-warnings",
"-o", str(output_path),
"-f", config["download_settings"]["format"],
video_url
]
print(f"🔧 Running command: {' '.join(dlp_cmd)}")
print(f"📺 Resolution settings: {config.get('download_settings', {}).get('preferred_resolution', 'Unknown')}")
print(f"🎬 Format string: {config.get('download_settings', {}).get('format', 'Unknown')}")
# Debug: Show available formats (optional)
if config.get('debug_show_formats', False):
show_available_formats(yt_dlp_path, video_url)
try:
result = subprocess.run(dlp_cmd, capture_output=True, text=True, check=True)
print(f"✅ yt-dlp completed successfully")
print(f"📄 yt-dlp stdout: {result.stdout}")
except subprocess.CalledProcessError as e:
print(f"❌ yt-dlp failed with exit code {e.returncode}")
print(f"❌ yt-dlp stderr: {e.stderr}")
# Mark song as failed in tracking
error_msg = f"yt-dlp failed with exit code {e.returncode}: {e.stderr}"
_mark_song_failed_standalone(artist, title, video_id, channel_name, error_msg)
return False
# Verify download
if not verify_download(output_path, artist, title, video_id, channel_name):
return False
# Post-processing
add_id3_tags(output_path, f"{artist} - {title} (Karaoke Version)", channel_name)
mark_songlist_song_downloaded(songlist_tracking, artist, title, channel_name, output_path)
print(f"✅ Downloaded and tracked: {artist} - {title}")
print(f"🎉 All post-processing complete for: {output_path}")
return True
def _mark_song_failed_standalone(artist, title, video_id, channel_name, error_message):
"""Standalone helper to mark a song as failed in tracking."""
from karaoke_downloader.tracking_manager import TrackingManager
tracker = TrackingManager()
tracker.mark_song_failed(artist, title, video_id, channel_name, error_message)
print(f"🏷️ Marked song as failed: {artist} - {title}")
def show_available_formats(yt_dlp_path, video_url):
"""Show available formats for debugging."""
print(f"🔍 Checking available formats for: {video_url}")
format_cmd = [
str(yt_dlp_path),
"--list-formats",
video_url
]
try:
format_result = subprocess.run(format_cmd, capture_output=True, text=True, timeout=DEFAULT_FORMAT_CHECK_TIMEOUT)
print(f"📋 Available formats:\n{format_result.stdout}")
except Exception as e:
print(f"⚠️ Could not check formats: {e}")
def verify_download(output_path, artist, title, video_id=None, channel_name=None):
"""Verify that the download was successful."""
if not output_path.exists():
print(f"❌ Download failed: file does not exist: {output_path}")
# Check if yt-dlp saved it somewhere else
possible_files = list(output_path.parent.glob("*.mp4"))
if possible_files:
print(f"🔍 Found these files in the directory: {[f.name for f in possible_files]}")
# Look for a file that matches our pattern (artist - title)
artist_part = artist.lower()
title_part = title.lower()
for file in possible_files:
file_lower = file.stem.lower()
if artist_part in file_lower and any(word in file_lower for word in title_part.split()):
print(f"🎯 Found matching file: {file.name}")
output_path = file
break
else:
print(f"❌ No matching file found for: {artist} - {title}")
# Mark song as failed if we have the required info
if video_id and channel_name:
error_msg = f"Download failed: file does not exist and no matching file found"
_mark_song_failed_standalone(artist, title, video_id, channel_name, error_msg)
return False
else:
# Mark song as failed if we have the required info
if video_id and channel_name:
error_msg = f"Download failed: file does not exist"
_mark_song_failed_standalone(artist, title, video_id, channel_name, error_msg)
return False
if output_path.stat().st_size == 0:
print(f"❌ Download failed: file is empty (0 bytes): {output_path}")
return False
# Optional MP4 validation
# if not is_valid_mp4(output_path):
# print(f"❌ File is not a valid MP4: {output_path}")
# return False
return True
def execute_download_plan(download_plan, unmatched, cache_file, config, yt_dlp_path,
downloads_dir, songlist_tracking, limit=None):
"""
Execute a download plan with progress tracking and cache management.
Args:
download_plan: List of download items to process
unmatched: List of unmatched songs
cache_file: Path to cache file for progress tracking
config: Configuration dictionary
yt_dlp_path: Path to yt-dlp executable
downloads_dir: Directory for downloads
songlist_tracking: Songlist tracking data
limit: Optional limit on number of downloads
Returns:
tuple: (downloaded_count, success)
"""
downloaded_count = 0
total_to_download = limit if limit is not None else len(download_plan)
for idx, item in enumerate(download_plan[:]): # Use slice to allow modification during iteration
if limit is not None and downloaded_count >= limit:
break
artist = item['artist']
title = item['title']
channel_name = item['channel_name']
channel_url = item['channel_url']
video_id = item['video_id']
video_title = item['video_title']
print(f"\n⬇️ Downloading {idx+1} of {total_to_download}:")
print(f" 📋 Songlist: {artist} - {title}")
print(f" 🎬 Video: {video_title} ({channel_name})")
if 'match_score' in item:
print(f" 🎯 Match Score: {item['match_score']:.1f}%")
# Create filename
filename = sanitize_filename(artist, title)
output_path = downloads_dir / channel_name / filename
# Download the file
success = download_single_video(
output_path, video_id, config, yt_dlp_path,
artist, title, channel_name, songlist_tracking
)
if success:
downloaded_count += 1
# Remove completed item from plan and update cache
download_plan.pop(idx)
save_plan_cache(cache_file, download_plan, unmatched)
print(f"🗑️ Removed completed item from download plan. {len(download_plan)} items remaining.")
# Delete cache if all items are complete
if len(download_plan) == 0:
cleanup_cache(cache_file)
print(f"🎉 Downloaded {downloaded_count} songlist songs.")
print(f"📊 Summary: Found {downloaded_count} songs, {len(unmatched)} songs not found.")
# Final cleanup
cleanup_cache(cache_file)
return downloaded_count, True
def cleanup_cache(cache_file):
"""Clean up the cache file."""
if cache_file.exists():
try:
cache_file.unlink()
print(f"🗑️ Deleted download plan cache: {cache_file.name}")
except Exception as e:
print(f"⚠️ Could not delete download plan cache: {e}")
def should_skip_song_standalone(artist, title, channel_name, video_id, video_title, downloads_dir, tracker=None, server_songs=None, server_duplicates_tracking=None):
"""
Standalone function to check if a song should be skipped.
Performs four checks in order:
1. Already downloaded (tracking) - if tracker provided
2. File exists on filesystem
3. Already on server - if server data provided
4. Previously failed download (bad file) - if tracker provided
Returns:
tuple: (should_skip, reason, total_filtered)
"""
total_filtered = 0
# Check 1: Already downloaded by this system (if tracker provided)
if tracker and tracker.is_song_downloaded(artist, title, channel_name, video_id):
return True, "already downloaded", total_filtered
# Check 2: File already exists on filesystem
# Generate the expected filename based on the download mode context
safe_title = title
invalid_chars = ['?', ':', '*', '"', '<', '>', '|', '/', '\\']
for char in invalid_chars:
safe_title = safe_title.replace(char, "")
safe_title = safe_title.replace("...", "").replace("..", "").replace(".", "").strip()
# Try different filename patterns that might exist
possible_filenames = [
f"{artist} - {safe_title}.mp4", # Songlist mode
f"{channel_name} - {safe_title}.mp4", # Latest-per-channel mode
f"{artist} - {safe_title} (Karaoke Version).mp4" # Channel videos mode
]
for filename in possible_filenames:
if len(filename) > DEFAULT_FILENAME_LENGTH_LIMIT:
# Apply length limits if needed
safe_artist = artist.replace("'", "").replace('"', "").strip()
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
output_path = downloads_dir / channel_name / filename
if output_path.exists() and output_path.stat().st_size > 0:
return True, "file exists", total_filtered
# Check 3: Already on server (if server data provided)
if server_songs is not None and server_duplicates_tracking is not None:
from karaoke_downloader.server_manager import check_and_mark_server_duplicate
if check_and_mark_server_duplicate(server_songs, server_duplicates_tracking, artist, title, video_title, channel_name):
total_filtered += 1
return True, "on server", total_filtered
# Check 4: Previously failed download (bad file) - if tracker provided
if tracker and tracker.is_song_failed(artist, title, channel_name, video_id):
return True, "previously failed", total_filtered
return False, None, total_filtered