Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>
This commit is contained in:
parent
981f92ce95
commit
613b64601a
55
PRD.md
55
PRD.md
@ -273,6 +273,7 @@ The codebase has been comprehensively refactored to improve maintainability and
|
||||
- [x] **Parallel downloads for improved speed** ✅ **COMPLETED**
|
||||
- [x] **Enhanced fuzzy matching with improved video title parsing** ✅ **COMPLETED**
|
||||
- [x] **Consolidated extract_artist_title function** ✅ **COMPLETED**
|
||||
- [x] **Duplicate file prevention and filename consistency** ✅ **COMPLETED**
|
||||
- [ ] Unit tests for all modules
|
||||
- [ ] Integration tests for end-to-end workflows
|
||||
- [ ] Plugin system for custom file operations
|
||||
@ -304,3 +305,57 @@ The codebase has been comprehensively refactored to improve maintainability and
|
||||
- **Consistent behavior**: All parts of the system use the same parsing logic, eliminating edge cases where different modules would parse the same title differently
|
||||
- **Improved performance**: The `--limit` parameter now works as expected, providing faster processing for targeted downloads
|
||||
- **Cleaner codebase**: Eliminated duplicate code and import conflicts, making the system more maintainable
|
||||
|
||||
## 🔧 Recent Bug Fixes & Improvements (v3.4.2)
|
||||
### **Duplicate File Prevention & Filename Consistency**
|
||||
- **Enhanced file existence checking**: `check_file_exists_with_patterns()` now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
|
||||
- **Automatic duplicate prevention**: Download pipeline skips downloads when files already exist (including duplicates)
|
||||
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files with suffixes
|
||||
- **Cleanup utility**: `data/cleanup_duplicate_files.py` provides interactive cleanup of existing duplicate files
|
||||
- **Filename vs ID3 tag consistency**: Removed "(Karaoke Version)" suffix from ID3 tags to match filenames exactly
|
||||
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction logic
|
||||
|
||||
### **Benefits of Duplicate Prevention**
|
||||
- **No more duplicate files**: Eliminates `(2)`, `(3)` suffix files that waste disk space
|
||||
- **Consistent metadata**: Filename and ID3 tag use identical artist/title format
|
||||
- **Efficient disk usage**: Prevents unnecessary downloads of existing files
|
||||
- **Clear file identification**: Consistent naming across all file operations
|
||||
|
||||
## 🛠️ Maintenance
|
||||
|
||||
### **Regular Cleanup**
|
||||
- Run the cleanup utility periodically to remove any duplicate files
|
||||
- Monitor downloads for any new duplicate creation (should be rare with fixes)
|
||||
|
||||
### **Configuration**
|
||||
- Keep `"nooverwrites": false` in `data/config.json`
|
||||
- This prevents yt-dlp from creating duplicate files
|
||||
|
||||
### **Monitoring**
|
||||
- Check logs for "⏭️ Skipping download - file already exists" messages
|
||||
- These indicate the duplicate prevention is working correctly
|
||||
|
||||
## 📚 Documentation Standards
|
||||
|
||||
### **Documentation Location**
|
||||
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
|
||||
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
|
||||
- **Use the existing sections in PRD.md and README.md to track all project evolution**
|
||||
|
||||
### **Where to Document Changes**
|
||||
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
|
||||
- **README.md**: User-facing features, usage instructions, and high-level improvements
|
||||
- **CHANGELOG.md**: Version-specific release notes and change summaries
|
||||
|
||||
### **Documentation Requirements**
|
||||
- **All new features must be documented in both PRD.md and README.md**
|
||||
- **All refactoring efforts must be documented in the appropriate sections**
|
||||
- **All bug fixes must be documented with technical details**
|
||||
- **Version numbers and dates should be clearly marked**
|
||||
- **Benefits and improvements should be explicitly stated**
|
||||
|
||||
### **Maintenance Responsibility**
|
||||
- **Keep PRD.md and README.md synchronized with code changes**
|
||||
- **Update documentation immediately when implementing new features**
|
||||
- **Remove outdated information and consolidate related changes**
|
||||
- **Ensure all CLI options and features are documented in both files**
|
||||
|
||||
54
README.md
54
README.md
@ -22,6 +22,8 @@ A Python-based Windows CLI tool to download karaoke videos from YouTube channels
|
||||
- 🏷️ **Server Duplicates Tracking**: Automatically checks against local songs.json file and marks duplicates for future skipping, preventing re-downloads of songs already on the server
|
||||
- ⚡ **Parallel Downloads**: Enable concurrent downloads with `--parallel --workers N` for significantly faster batch downloads (3-5x speedup)
|
||||
- 📊 **Unmatched Songs Reports**: Generate detailed reports of songs that couldn't be found in any channel with `--generate-unmatched-report`
|
||||
- 🛡️ **Duplicate File Prevention**: Automatically detects and prevents duplicate files with `(2)`, `(3)` suffixes, with cleanup utility for existing duplicates
|
||||
- 🏷️ **Consistent Metadata**: Filename and ID3 tag use identical artist/title format for clear file identification
|
||||
|
||||
## 🏗️ Architecture
|
||||
The codebase has been comprehensively refactored into a modular architecture with centralized utilities for improved maintainability, error handling, and code reuse:
|
||||
@ -99,6 +101,33 @@ The codebase has been comprehensively refactored into a modular architecture wit
|
||||
- **Fixed import conflicts**: Resolved inconsistencies between different parsing implementations
|
||||
- **Single source of truth**: All title parsing logic is now centralized in `fuzzy_matcher.py`
|
||||
|
||||
## 🛡️ Duplicate File Prevention & Filename Consistency (v3.4.2)
|
||||
### **Duplicate File Prevention**
|
||||
- **Enhanced file existence checking**: Now detects files with `(2)`, `(3)`, etc. suffixes that yt-dlp creates
|
||||
- **Automatic duplicate prevention**: Skips downloads when files already exist (including duplicates)
|
||||
- **Updated yt-dlp configuration**: Set `"nooverwrites": false` to prevent yt-dlp from creating duplicate files
|
||||
- **Cleanup utility**: `data/cleanup_duplicate_files.py` helps identify and remove existing duplicate files
|
||||
|
||||
### **Filename vs ID3 Tag Consistency**
|
||||
- **Consistent metadata**: Filename and ID3 tag now use identical artist/title format
|
||||
- **Removed extra suffixes**: No more "(Karaoke Version)" in ID3 tags that don't match filenames
|
||||
- **Unified parsing**: Both filename generation and ID3 tagging use the same artist/title extraction
|
||||
|
||||
### **Benefits**
|
||||
- ✅ **No more duplicate files** with `(2)`, `(3)` suffixes
|
||||
- ✅ **Consistent metadata** between filename and ID3 tags
|
||||
- ✅ **Efficient disk usage** by preventing unnecessary downloads
|
||||
- ✅ **Clear file identification** with consistent naming
|
||||
|
||||
### **Clean Up Existing Duplicates**
|
||||
```bash
|
||||
# Run the cleanup utility to find and remove existing duplicates
|
||||
python data/cleanup_duplicate_files.py
|
||||
|
||||
# Choose option 1 for dry run (recommended first)
|
||||
# Choose option 2 to actually delete duplicates
|
||||
```
|
||||
|
||||
## 📋 Requirements
|
||||
- **Windows 10/11**
|
||||
- **Python 3.7+**
|
||||
@ -370,6 +399,31 @@ python download_karaoke.py --generate-unmatched-report --fuzzy-match --fuzzy-thr
|
||||
|
||||
> **🔄 Maintenance Note**: The `commands.txt` file should be kept up to date with any CLI changes. When adding new command-line options or modifying existing ones, update this file to reflect all available commands and their usage.
|
||||
|
||||
## 📚 Documentation Standards
|
||||
|
||||
### **Documentation Location**
|
||||
- **All changes, refactoring, and improvements should be documented in the PRD.md and README.md files**
|
||||
- **Do NOT create separate .md files for documenting changes, refactoring, or improvements**
|
||||
- **Use the existing sections in PRD.md and README.md to track all project evolution**
|
||||
|
||||
### **Where to Document Changes**
|
||||
- **PRD.md**: Technical details, architecture changes, bug fixes, and implementation specifics
|
||||
- **README.md**: User-facing features, usage instructions, and high-level improvements
|
||||
- **CHANGELOG.md**: Version-specific release notes and change summaries
|
||||
|
||||
### **Documentation Requirements**
|
||||
- **All new features must be documented in both PRD.md and README.md**
|
||||
- **All refactoring efforts must be documented in the appropriate sections**
|
||||
- **All bug fixes must be documented with technical details**
|
||||
- **Version numbers and dates should be clearly marked**
|
||||
- **Benefits and improvements should be explicitly stated**
|
||||
|
||||
### **Maintenance Responsibility**
|
||||
- **Keep PRD.md and README.md synchronized with code changes**
|
||||
- **Update documentation immediately when implementing new features**
|
||||
- **Remove outdated information and consolidate related changes**
|
||||
- **Ensure all CLI options and features are documented in both files**
|
||||
|
||||
## 🔧 Refactoring Improvements (v3.3)
|
||||
The codebase has been comprehensively refactored to improve maintainability and reduce code duplication. Recent improvements have enhanced reliability, performance, and code organization:
|
||||
|
||||
|
||||
164
data/cleanup_duplicate_files.py
Normal file
164
data/cleanup_duplicate_files.py
Normal file
@ -0,0 +1,164 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Utility script to identify and clean up duplicate files with (2), (3) suffixes.
|
||||
This helps clean up files that were created before the duplicate prevention was implemented.
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Tuple
|
||||
|
||||
def find_duplicate_files(downloads_dir: str = "downloads") -> Dict[str, List[Path]]:
|
||||
"""
|
||||
Find duplicate files with (2), (3), etc. suffixes in the downloads directory.
|
||||
|
||||
Args:
|
||||
downloads_dir: Path to downloads directory
|
||||
|
||||
Returns:
|
||||
Dictionary mapping base filenames to lists of duplicate files
|
||||
"""
|
||||
downloads_path = Path(downloads_dir)
|
||||
if not downloads_path.exists():
|
||||
print(f"❌ Downloads directory not found: {downloads_dir}")
|
||||
return {}
|
||||
|
||||
duplicates = {}
|
||||
|
||||
# Scan all MP4 files in the downloads directory
|
||||
for mp4_file in downloads_path.rglob("*.mp4"):
|
||||
filename = mp4_file.name
|
||||
|
||||
# Check if this is a duplicate file with (2), (3), etc.
|
||||
match = re.match(r'^(.+?)\s*\((\d+)\)\.mp4$', filename)
|
||||
if match:
|
||||
base_name = match.group(1)
|
||||
suffix_num = int(match.group(2))
|
||||
|
||||
if base_name not in duplicates:
|
||||
duplicates[base_name] = []
|
||||
|
||||
duplicates[base_name].append((mp4_file, suffix_num))
|
||||
|
||||
# Sort duplicates by suffix number
|
||||
for base_name in duplicates:
|
||||
duplicates[base_name].sort(key=lambda x: x[1])
|
||||
|
||||
return duplicates
|
||||
|
||||
def analyze_duplicates(duplicates: Dict[str, List[Tuple[Path, int]]]) -> None:
|
||||
"""
|
||||
Analyze and display information about found duplicates.
|
||||
|
||||
Args:
|
||||
duplicates: Dictionary of duplicate files
|
||||
"""
|
||||
if not duplicates:
|
||||
print("✅ No duplicate files found!")
|
||||
return
|
||||
|
||||
print(f"🔍 Found {len(duplicates)} sets of duplicate files:")
|
||||
print()
|
||||
|
||||
total_duplicates = 0
|
||||
for base_name, files in duplicates.items():
|
||||
print(f"📁 {base_name}")
|
||||
for file_path, suffix in files:
|
||||
file_size = file_path.stat().st_size / (1024 * 1024) # MB
|
||||
print(f" ({suffix}) {file_path.name} - {file_size:.1f} MB")
|
||||
print()
|
||||
total_duplicates += len(files) - 1 # -1 because we keep the original
|
||||
|
||||
print(f"📊 Summary: {len(duplicates)} base files with {total_duplicates} duplicate files")
|
||||
|
||||
def cleanup_duplicates(duplicates: Dict[str, List[Tuple[Path, int]]], dry_run: bool = True) -> None:
|
||||
"""
|
||||
Clean up duplicate files, keeping only the first occurrence.
|
||||
|
||||
Args:
|
||||
duplicates: Dictionary of duplicate files
|
||||
dry_run: If True, only show what would be deleted without actually deleting
|
||||
"""
|
||||
if not duplicates:
|
||||
print("✅ No duplicates to clean up!")
|
||||
return
|
||||
|
||||
mode = "DRY RUN" if dry_run else "ACTUAL CLEANUP"
|
||||
print(f"🧹 Starting {mode}...")
|
||||
print()
|
||||
|
||||
total_deleted = 0
|
||||
total_size_freed = 0
|
||||
|
||||
for base_name, files in duplicates.items():
|
||||
print(f"📁 Processing: {base_name}")
|
||||
|
||||
# Keep the first file (lowest suffix number), delete the rest
|
||||
files_to_delete = files[1:] # Skip the first file
|
||||
|
||||
for file_path, suffix in files_to_delete:
|
||||
file_size = file_path.stat().st_size / (1024 * 1024) # MB
|
||||
|
||||
if dry_run:
|
||||
print(f" 🗑️ Would delete: {file_path.name} ({file_size:.1f} MB)")
|
||||
else:
|
||||
try:
|
||||
file_path.unlink()
|
||||
print(f" ✅ Deleted: {file_path.name} ({file_size:.1f} MB)")
|
||||
total_deleted += 1
|
||||
total_size_freed += file_size
|
||||
except Exception as e:
|
||||
print(f" ❌ Failed to delete {file_path.name}: {e}")
|
||||
|
||||
print()
|
||||
|
||||
if dry_run:
|
||||
print(f"📊 DRY RUN SUMMARY: Would delete {len([f for files in duplicates.values() for f in files[1:]])} files")
|
||||
else:
|
||||
print(f"📊 CLEANUP SUMMARY: Deleted {total_deleted} files, freed {total_size_freed:.1f} MB")
|
||||
|
||||
def main():
|
||||
"""Main function to run the duplicate file cleanup."""
|
||||
print("🎵 Karaoke Video Downloader - Duplicate File Cleanup")
|
||||
print("=" * 50)
|
||||
print()
|
||||
|
||||
# Find duplicates
|
||||
duplicates = find_duplicate_files()
|
||||
|
||||
if not duplicates:
|
||||
print("✅ No duplicate files found!")
|
||||
return
|
||||
|
||||
# Analyze duplicates
|
||||
analyze_duplicates(duplicates)
|
||||
print()
|
||||
|
||||
# Ask user what to do
|
||||
while True:
|
||||
print("Options:")
|
||||
print("1. Dry run (show what would be deleted)")
|
||||
print("2. Actually delete duplicate files")
|
||||
print("3. Exit without doing anything")
|
||||
|
||||
choice = input("\nEnter your choice (1-3): ").strip()
|
||||
|
||||
if choice == "1":
|
||||
cleanup_duplicates(duplicates, dry_run=True)
|
||||
break
|
||||
elif choice == "2":
|
||||
confirm = input("⚠️ Are you sure you want to delete duplicate files? (yes/no): ").strip().lower()
|
||||
if confirm in ["yes", "y"]:
|
||||
cleanup_duplicates(duplicates, dry_run=False)
|
||||
else:
|
||||
print("❌ Cleanup cancelled.")
|
||||
break
|
||||
elif choice == "3":
|
||||
print("❌ Exiting without cleanup.")
|
||||
break
|
||||
else:
|
||||
print("❌ Invalid choice. Please enter 1, 2, or 3.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -19,7 +19,7 @@
|
||||
"writethumbnail": false,
|
||||
"embed_metadata": false,
|
||||
"continuedl": true,
|
||||
"nooverwrites": true,
|
||||
"nooverwrites": false,
|
||||
"ignoreerrors": true,
|
||||
"no_warnings": false
|
||||
},
|
||||
|
||||
@ -20,6 +20,12 @@ from karaoke_downloader.youtube_utils import (
|
||||
execute_yt_dlp_command,
|
||||
show_available_formats,
|
||||
)
|
||||
from karaoke_downloader.file_utils import (
|
||||
cleanup_temp_files,
|
||||
get_unique_filename,
|
||||
is_valid_mp4_file,
|
||||
sanitize_filename,
|
||||
)
|
||||
|
||||
|
||||
class DownloadPipeline:
|
||||
@ -63,9 +69,15 @@ class DownloadPipeline:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
# Step 1: Prepare file path
|
||||
filename = sanitize_filename(artist, title)
|
||||
output_path = self.downloads_dir / channel_name / filename
|
||||
# Step 1: Prepare file path and check for existing files
|
||||
output_path, file_exists = get_unique_filename(self.downloads_dir, channel_name, artist, title)
|
||||
|
||||
if file_exists:
|
||||
print(f"⏭️ Skipping download - file already exists: {output_path.name}")
|
||||
# Still add tags and track the existing file
|
||||
if self._add_tags(output_path, artist, title, channel_name):
|
||||
self._track_download(output_path, artist, title, video_id, channel_name)
|
||||
return True
|
||||
|
||||
# Step 2: Download video
|
||||
if not self._download_video(video_id, output_path, artist, title, channel_name):
|
||||
@ -214,8 +226,10 @@ class DownloadPipeline:
|
||||
) -> bool:
|
||||
"""Step 3: Add ID3 tags to the downloaded file."""
|
||||
try:
|
||||
# Use the same artist/title as the filename for consistency
|
||||
# Don't add "(Karaoke Version)" to the ID3 tag title
|
||||
add_id3_tags(
|
||||
output_path, f"{artist} - {title} (Karaoke Version)", channel_name
|
||||
output_path, f"{artist} - {title}", channel_name
|
||||
)
|
||||
print(f"🏷️ Added ID3 tags: {artist} - {title}")
|
||||
return True
|
||||
|
||||
@ -112,6 +112,7 @@ def check_file_exists_with_patterns(
|
||||
) -> Tuple[bool, Optional[Path]]:
|
||||
"""
|
||||
Check if a file exists using multiple possible filename patterns.
|
||||
Also checks for files with (2), (3), etc. suffixes that yt-dlp might create.
|
||||
|
||||
Args:
|
||||
downloads_dir: Base downloads directory
|
||||
@ -132,13 +133,51 @@ def check_file_exists_with_patterns(
|
||||
safe_title = sanitize_title_for_filenames(title)
|
||||
filename = f"{safe_artist[:DEFAULT_ARTIST_LENGTH_LIMIT]} - {safe_title[:DEFAULT_TITLE_LENGTH_LIMIT]}.mp4"
|
||||
|
||||
# Check for exact filename match
|
||||
file_path = channel_dir / filename
|
||||
if file_path.exists() and file_path.stat().st_size > 0:
|
||||
return True, file_path
|
||||
|
||||
# Check for files with (2), (3), etc. suffixes
|
||||
base_name = filename.replace(".mp4", "")
|
||||
for suffix in range(2, 10): # Check up to (9)
|
||||
suffixed_filename = f"{base_name} ({suffix}).mp4"
|
||||
suffixed_path = channel_dir / suffixed_filename
|
||||
if suffixed_path.exists() and suffixed_path.stat().st_size > 0:
|
||||
return True, suffixed_path
|
||||
|
||||
return False, None
|
||||
|
||||
|
||||
def get_unique_filename(
|
||||
downloads_dir: Path, channel_name: str, artist: str, title: str
|
||||
) -> Tuple[Path, bool]:
|
||||
"""
|
||||
Get a unique filename for download, checking for existing files including duplicates.
|
||||
|
||||
Args:
|
||||
downloads_dir: Base downloads directory
|
||||
channel_name: Channel name
|
||||
artist: Song artist
|
||||
title: Song title
|
||||
|
||||
Returns:
|
||||
Tuple of (file_path, is_existing) where is_existing indicates if a file already exists
|
||||
"""
|
||||
filename = sanitize_filename(artist, title)
|
||||
channel_dir = downloads_dir / channel_name
|
||||
file_path = channel_dir / filename
|
||||
|
||||
# Check if file already exists
|
||||
exists, existing_path = check_file_exists_with_patterns(downloads_dir, channel_name, artist, title)
|
||||
|
||||
if exists and existing_path:
|
||||
print(f"📁 File already exists: {existing_path.name}")
|
||||
return existing_path, True
|
||||
|
||||
return file_path, False
|
||||
|
||||
|
||||
def ensure_directory_exists(directory: Path) -> None:
|
||||
"""
|
||||
Ensure a directory exists, creating it if necessary.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user