Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

This commit is contained in:
mbrucedogs 2025-07-26 17:13:49 -05:00
parent 148fd2a141
commit 3969d75f0f
8 changed files with 694 additions and 451 deletions

99
PRD.md
View File

@ -42,6 +42,105 @@
These principles are fundamental to the project's long-term success and must be applied consistently throughout development.
### 2.2 Documentation Requirements
**CRITICAL REQUIREMENT:** All code changes, feature additions, or modifications MUST be accompanied by corresponding updates to the project documentation:
- **PRD.md Updates:** Any changes to project requirements, architecture, or functionality must be reflected in this document
- **README.md Updates:** User-facing features, installation instructions, or usage changes must be documented
- **Code Comments:** Significant logic changes should include inline documentation
- **API Documentation:** New endpoints, functions, or interfaces must be documented
**Documentation Update Checklist:**
- [ ] Update PRD.md with any architectural or requirement changes
- [ ] Update README.md with new features, installation steps, or usage instructions
- [ ] Add inline comments for complex logic or business rules
- [ ] Update any configuration examples or file structure documentation
- [ ] Review and update implementation status sections
This documentation requirement is mandatory and ensures the project remains maintainable and accessible to future developers and users.
### 2.3 Code Quality & Development Standards
**MANDATORY STANDARDS:** The following standards must be followed to ensure code quality, maintainability, and AI-friendly development:
#### **Naming Conventions**
- **Files:** Use descriptive, lowercase names with underscores (`song_matcher.py`, `priority_manager.py`)
- **Classes:** PascalCase (`SongMatcher`, `PreferencesManager`)
- **Functions/Methods:** snake_case (`process_songs`, `get_priority_order`)
- **Constants:** UPPER_SNAKE_CASE (`MAX_FILE_SIZE`, `DEFAULT_CHANNEL_PRIORITY`)
- **Variables:** snake_case with descriptive names (`song_collection`, `duplicate_count`)
#### **Code Structure Standards**
- **Function Length:** Maximum 50 lines per function (aim for 20-30 lines)
- **Class Length:** Maximum 300 lines per class (aim for 100-200 lines)
- **File Length:** Maximum 500 lines per file (aim for 200-400 lines)
- **Indentation:** 4 spaces (no tabs)
- **Line Length:** Maximum 120 characters
- **Import Organization:** Group imports: standard library, third-party, local (alphabetical within groups)
#### **Error Handling & Logging**
- **Exception Handling:** Always use specific exception types, never bare `except:`
- **Logging:** Use Python's `logging` module with appropriate levels (DEBUG, INFO, WARNING, ERROR)
- **User Feedback:** Provide clear, actionable error messages
- **Graceful Degradation:** Handle missing files/configs gracefully with sensible defaults
#### **Type Hints & Documentation**
- **Type Hints:** Use Python type hints for all function parameters and return values
- **Docstrings:** Include docstrings for all public functions, classes, and modules
- **Docstring Format:** Use Google-style docstrings with parameter descriptions
- **Complex Logic:** Add inline comments explaining business logic and algorithms
#### **Testing Standards**
- **Unit Tests:** Write unit tests for all business logic functions
- **Test Coverage:** Aim for 80%+ code coverage
- **Test Organization:** Mirror the source code structure in test files
- **Test Data:** Use fixtures and factories for test data, never hardcode test values
- **Integration Tests:** Test complete workflows and API endpoints
#### **Configuration Management**
- **Environment Variables:** Use environment variables for sensitive data (API keys, paths)
- **Config Validation:** Validate configuration on startup with clear error messages
- **Default Values:** Provide sensible defaults for all configuration options
- **Config Documentation:** Document all configuration options with examples
#### **Performance & Scalability**
- **Memory Efficiency:** Process large datasets in chunks, avoid loading everything into memory
- **Progress Indicators:** Show progress for long-running operations
- **Caching:** Implement appropriate caching for expensive operations
- **Async Operations:** Use async/await for I/O operations where beneficial
#### **Security Best Practices**
- **Input Validation:** Validate and sanitize all user inputs
- **File Operations:** Use `pathlib` for safe file path handling
- **JSON Safety:** Use `json.loads()` with proper error handling
- **No Hardcoded Secrets:** Never commit API keys, passwords, or sensitive data
#### **Version Control Standards**
- **Commit Messages:** Use conventional commit format (`feat:`, `fix:`, `docs:`, `refactor:`)
- **Branch Naming:** Use descriptive branch names (`feature/priority-management`, `fix/duplicate-detection`)
- **Pull Requests:** Require code review for all changes
- **Git Hooks:** Use pre-commit hooks for linting and formatting
#### **Dependency Management**
- **Requirements:** Keep `requirements.txt` updated with exact versions
- **Virtual Environments:** Always use virtual environments for development
- **Dependency Updates:** Regularly update dependencies and test compatibility
- **Minimal Dependencies:** Only include necessary dependencies, avoid bloat
#### **Code Review Checklist**
- [ ] Code follows naming conventions
- [ ] Functions are appropriately sized and focused
- [ ] Error handling is comprehensive
- [ ] Type hints and docstrings are present
- [ ] Tests are included for new functionality
- [ ] Configuration is properly validated
- [ ] No hardcoded values or secrets
- [ ] Performance considerations addressed
- [ ] Documentation is updated
These standards ensure the codebase remains clean, maintainable, and accessible to both human developers and AI assistants.
---
## 3. Data Handling & Matching Logic

518
README.md
View File

@ -1,39 +1,103 @@
# Karaoke Song Library Cleanup Tool
A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.
A comprehensive tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats and generates a "skip list" for future imports.
## 🎯 Features
## Features
- **Smart Duplicate Detection**: Identifies duplicate songs by artist and title
- **MP3 Pairing Logic**: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
- **Multi-Format Support**: Handles MP3 and MP4 files with intelligent priority system
- **Channel Priority System**: Configurable priority for MP4 channels based on folder names in file paths
- **Non-Destructive**: Only generates skip lists - never deletes or moves files
- **Detailed Reporting**: Comprehensive statistics and analysis reports
- **Flexible Configuration**: Customizable matching rules and output options
- **Performance Optimized**: Handles large libraries (37,000+ songs) efficiently
- **Future-Ready**: Designed for easy expansion to web UI
### Core Functionality
- **Song Deduplication**: Identifies duplicate songs based on artist + title matching
- **Multi-Format Support**: Handles MP3, CDG, and MP4 files
- **CDG/MP3 Pairing**: Treats CDG and MP3 files with the same base filename as single karaoke units
- **Channel Priority**: For MP4 files, prioritizes based on folder names in the path
- **Fuzzy Matching**: Configurable fuzzy matching for artist/title comparison
## 📁 Project Structure
### File Type Priority System
1. **MP4 files** (with channel priority sorting)
2. **CDG/MP3 pairs** (treated as single units)
3. **Standalone MP3** files
4. **Standalone CDG** files
### Web UI Features
- **Interactive Table View**: Sortable, filterable grid of duplicate songs
- **Bulk Selection**: Select multiple items for batch operations
- **Search & Filter**: Real-time search across artists, titles, and paths
- **Responsive Design**: Mobile-friendly interface
- **Easy Startup**: Automated dependency checking and browser launch
### 🆕 Drag-and-Drop Priority Management
- **Visual Priority Reordering**: Drag and drop files within each duplicate group to change their priority
- **Persistent Preferences**: Save your priority preferences for future CLI runs
- **Priority Indicators**: Visual numbered indicators show the current priority order
- **Reset Functionality**: Easily reset to default priorities if needed
## Installation
1. Clone the repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
### CLI Tool
Run the main CLI tool:
```bash
python cli/main.py
```
Options:
- `--verbose`: Enable verbose output
- `--save-reports`: Generate detailed analysis reports
- `--dry-run`: Show what would be done without making changes
### Web UI
Start the web interface:
```bash
python start_web_ui.py
```
The web UI will automatically:
1. Check for required dependencies
2. Start the Flask server
3. Open your default browser to the interface
### Priority Preferences
The web UI now supports drag-and-drop priority management:
1. **Reorder Files**: Click the "Details" button for any duplicate group, then drag files to reorder them
2. **Save Preferences**: Click "Save Priority Preferences" to store your choices
3. **Apply to CLI**: Future CLI runs will automatically use your saved preferences
4. **Reset**: Use "Reset Priorities" to restore default behavior
Your preferences are saved in `data/preferences/priority_preferences.json` and will be automatically loaded by the CLI tool.
## Configuration
Edit `config/config.json` to customize:
- Channel priorities for MP4 files
- Matching settings (fuzzy matching, thresholds)
- Output options
## File Structure
```
KaraokeMerge/
├── data/
│ ├── allSongs.json # Input: Your song library data
│ ├── skipSongs.json # Output: Generated skip list
│ ├── preferences/ # User priority preferences
│ │ └── priority_preferences.json
│ └── reports/ # Detailed analysis reports
│ ├── analysis_data.json
│ ├── actionable_insights_report.txt
│ ├── channel_optimization_report.txt
│ ├── duplicate_pattern_report.txt
│ ├── enhanced_summary_report.txt
│ ├── skip_list_summary.txt
│ └── skip_songs_detailed.json
├── config/
│ └── config.json # Configuration settings
├── cli/
│ ├── main.py # Main CLI application
│ ├── matching.py # Song matching logic
│ ├── preferences.py # Priority preferences manager
│ ├── report.py # Report generation
│ └── utils.py # Utility functions
├── web/ # Web UI for manual review
@ -43,415 +107,35 @@ KaraokeMerge/
├── start_web_ui.py # Web UI startup script
├── test_tool.py # Validation and testing script
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── PRD.md # Product Requirements Document
└── README.md # This file
└── README.md # Project documentation
```
## 🚀 Quick Start
## Data Requirements
### Prerequisites
- Python 3.7 or higher
- Your karaoke song data in JSON format (see Data Format section)
### Required Data File
**Important**: You need to provide your own `data/allSongs.json` file. This file is excluded from version control due to its large size and personal nature.
**Sample `allSongs.json` format:**
Place your song library data in `data/allSongs.json` with the following format:
```json
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"disabled": false,
"favorite": false
},
{
"artist": "Queen",
"title": "Bohemian Rhapsody",
"path": "z://MP4\\Sing King Karaoke\\Queen - Bohemian Rhapsody (Karaoke Version).mp4",
"guid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"disabled": false,
"favorite": true
"artist": "Artist Name",
"title": "Song Title",
"path": "path/to/file.mp3"
}
]
```
**Required fields:**
- `artist`: Song artist name
- `title`: Song title
- `path`: Full file path to the song file
- `guid`: Unique identifier for the song
## Performance
**Optional fields:**
- `disabled`: Boolean indicating if song is disabled (default: false)
- `favorite`: Boolean indicating if song is favorited (default: false)
Successfully tested with:
- 37,015 songs
- 12,424 duplicates (33.6% duplicate rate)
- 10,998 unique files after deduplication
### Installation
## Contributing
1. Clone or download this repository
2. Navigate to the project directory
3. Ensure your `data/allSongs.json` file is in place
### Basic Usage
```bash
# Run with default settings
python cli/main.py
# Enable verbose output
python cli/main.py --verbose
# Dry run (analyze without generating skip list)
python cli/main.py --dry-run
# Save detailed reports
python cli/main.py --save-reports
# Test the tool functionality
python test_tool.py
# Start the web UI for manual review
python start_web_ui.py
### Command Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `--config` | Path to configuration file | `../config/config.json` |
| `--input` | Path to input songs file | `../data/allSongs.json` |
| `--output-dir` | Directory for output files | `../data` |
| `--verbose, -v` | Enable verbose output | `False` |
| `--dry-run` | Analyze without generating skip list | `False` |
| `--save-reports` | Save detailed reports to files | `False` |
| `--show-config` | Show current configuration and exit | `False` |
## 📊 Data Format
### Input Format (`allSongs.json`)
Your song data should be a JSON array with objects containing at least these fields:
```json
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"disabled": false,
"favorite": false
}
]
```
### Output Format (`skipSongs.json`)
The tool generates a skip list with this structure:
```json
[
{
"path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
"reason": "duplicate",
"artist": "ACDC",
"title": "Shot In The Dark",
"kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
}
]
```
**Skip List Features:**
- **Metadata**: Each skip entry includes artist, title, and the path of the kept version
- **Reason Tracking**: Documents why each file was marked for skipping
- **Complete Information**: Provides full context for manual review if needed
## ⚙️ Configuration
Edit `config/config.json` to customize the tool's behavior:
### Channel Priorities (MP4 files)
```json
{
"channel_priorities": [
"Sing King Karaoke",
"KaraFun Karaoke",
"Stingray Karaoke"
]
}
```
**Note**: Channel priorities are now folder names found in the song's `path` property. The tool searches for these exact folder names within the file path to determine priority.
### Matching Settings
```json
{
"matching": {
"fuzzy_matching": false,
"fuzzy_threshold": 0.8,
"case_sensitive": false
}
}
```
### Output Settings
```json
{
"output": {
"verbose": false,
"include_reasons": true,
"max_duplicates_per_song": 10
}
}
```
## 🌐 Web UI for Manual Review
The project includes a web interface for interactive review of duplicate songs:
### Starting the Web UI
```bash
python start_web_ui.py
```
This script will:
- Check for required dependencies (Flask)
- Install missing dependencies automatically
- Validate required data files exist
- Start the web server
- Open your browser automatically
### Web UI Features
- **Interactive Table**: Sortable, filterable grid of duplicate songs
- **Bulk Selection**: Select multiple items for batch operations
- **Real-time Search**: Filter by artist, title, or file path
- **Responsive Design**: Works on desktop and mobile devices
- **Detailed Information**: View full metadata for each duplicate
### Web UI Requirements
- Flask web framework (automatically installed if missing)
- Generated skip list data (`data/skipSongs.json`)
- Configuration file (`config/config.json`)
## 📈 Understanding the Output
### Summary Report
- **Total songs processed**: Total number of songs analyzed
- **Unique songs found**: Number of unique artist-title combinations
- **Duplicates identified**: Number of duplicate songs found
- **File type breakdown**: Distribution across MP3, CDG, MP4 formats
- **Channel breakdown**: MP4 channel distribution (if applicable)
### Skip List
The generated `skipSongs.json` contains paths to files that should be skipped during future imports. Each entry includes:
- `path`: File path to skip
- `reason`: Why the file was marked for skipping (usually "duplicate")
## 🔧 Advanced Features
### Multi-Artist Handling
The tool automatically handles songs with multiple artists using various delimiters:
- `feat.`, `ft.`, `featuring`
- `&`, `and`
- `,`, `;`, `/`
### File Type Priority System
The tool uses a sophisticated priority system to select the best version of each song:
1. **MP4 files are always preferred** when available
- Searches for configured folder names within the file path
- Sorts by configured priority order (first in list = highest priority)
- Keeps the highest priority MP4 version
2. **CDG/MP3 pairs** are treated as single units
- Automatically pairs CDG and MP3 files with the same base filename
- Example: `song.cdg` + `song.mp3` = one complete karaoke song
- Only considered if no MP4 files exist for the same artist/title
3. **Standalone files** are lowest priority
- Standalone MP3 files (without matching CDG)
- Standalone CDG files (without matching MP3)
4. **Manual review candidates**
- Songs without matching folder names in channel priorities
- Ambiguous cases requiring human decision
### CDG/MP3 Pairing Logic
The tool automatically identifies and pairs CDG/MP3 files:
- **Base filename matching**: Files with identical names but different extensions
- **Single unit treatment**: Paired files are considered one complete karaoke song
- **Accurate duplicate detection**: Prevents treating paired files as separate duplicates
- **Proper priority handling**: Ensures complete songs compete fairly with MP4 versions
### Enhanced Analysis & Reporting
Use `--save-reports` to generate comprehensive analysis files:
**📊 Enhanced Reports:**
- `enhanced_summary_report.txt`: Comprehensive analysis with detailed statistics
- `channel_optimization_report.txt`: Channel priority optimization suggestions
- `duplicate_pattern_report.txt`: Duplicate pattern analysis by artist, title, and channel
- `actionable_insights_report.txt`: Recommendations and actionable insights
- `analysis_data.json`: Raw analysis data for further processing
**📋 Legacy Reports:**
- `summary_report.txt`: Basic overall statistics
- `duplicate_details.txt`: Detailed duplicate analysis (verbose mode only)
- `skip_list_summary.txt`: Skip list breakdown
- `skip_songs_detailed.json`: Full skip data with metadata
**🔍 Analysis Features:**
- **Pattern Analysis**: Identifies most duplicated artists, titles, and channels
- **Channel Optimization**: Suggests optimal channel priority order based on effectiveness
- **Storage Insights**: Quantifies space savings potential and duplicate distribution
- **Actionable Recommendations**: Provides specific suggestions for library optimization
## 🛠️ Development
### Project Structure for Expansion
The codebase is designed for easy expansion:
- **Modular Design**: Separate modules for matching, reporting, and utilities
- **Configuration-Driven**: Easy to modify behavior without code changes
- **Web UI Implementation**: Full web interface for manual review and bulk operations
- **Testing Framework**: Built-in test tool for validation and debugging
- **Dependency Management**: Automated setup and dependency checking
### Testing and Validation
Use the built-in test tool to validate your setup:
```bash
python test_tool.py
```
This will:
- Test all module imports
- Validate configuration loading
- Test with a sample of your song data
- Verify report generation
- Provide feedback on any issues
### Adding New Features
1. **New File Formats**: Add extensions to `config.json`
2. **New Matching Rules**: Extend `SongMatcher` class in `matching.py`
3. **New Reports**: Add methods to `ReportGenerator` class
4. **Web UI Enhancements**: Extend `web/app.py` and `web/templates/index.html`
5. **Testing**: Add test cases to `test_tool.py`
## 🎯 Current Status
### ✅ **Completed Features**
- **Core CLI Tool**: Fully functional with comprehensive duplicate detection
- **CDG/MP3 Pairing**: Intelligent pairing logic for accurate karaoke song handling
- **Channel Priority System**: Configurable MP4 channel priorities based on folder names
- **Skip List Generation**: Complete skip list with metadata and reasoning
- **Performance Optimization**: Handles large libraries (37,000+ songs) efficiently
- **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights
- **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions
- **Web UI**: Interactive web interface for manual review and bulk operations
- **Testing & Validation**: Test tool for functionality validation and debugging
- **Dependency Management**: Automated dependency checking and installation
- **Project Documentation**: Comprehensive .gitignore and updated documentation
### 🚀 **Ready for Use**
The tool is production-ready and has successfully processed a large karaoke library:
- Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
- Identified 33.6% duplicate rate with significant space savings potential
- Provided complete metadata for informed decision-making
- **Bug Fix**: Resolved duplicate entries in skip list generation
## 🔮 Future Roadmap
### Phase 2: Enhanced Analysis & Reporting ✅
- ✅ Generate detailed analysis reports (`--save-reports` functionality)
- ✅ Analyze MP4 files without channel priorities to suggest new folder names
- ✅ Create comprehensive duplicate analysis reports
- ✅ Add statistical insights and trends
- ✅ Pattern analysis and channel optimization suggestions
### Phase 3: Web Interface ✅
- ✅ Interactive table/grid for duplicate review
- ✅ Bulk actions and manual overrides
- ✅ Real-time filtering and search
- ✅ Responsive design for mobile/desktop
- ✅ Easy startup with dependency checking
- [ ] Embedded media player for preview
- [ ] Real-time configuration editing
- [ ] Advanced export capabilities
### Phase 4: Advanced Features
- Audio fingerprinting for better duplicate detection
- Integration with karaoke software APIs
- Batch processing and automation
- Advanced fuzzy matching algorithms
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## 📝 License
This project is open source. Feel free to use, modify, and distribute according to your needs.
## 🆘 Troubleshooting
### Common Issues
**"File not found" errors**
- Ensure `data/allSongs.json` exists and is readable
- Check file paths in your song data
**"Invalid JSON" errors**
- Validate your JSON syntax using an online validator
- Check for missing commas or brackets
**Memory issues with large libraries**
- The tool is optimized for large datasets
- Consider running with `--dry-run` first to test
### Getting Help
1. **Test your setup**: Run `python test_tool.py` to validate everything is working
2. **Check configuration**: Use `python cli/main.py --show-config` to verify settings
3. **Verbose output**: Run with `--verbose` for detailed information
4. **Dry run**: Use `--dry-run` to test without generating files
5. **Web UI**: Start `python start_web_ui.py` for interactive review
## 📊 Performance & Results
The tool is optimized for large karaoke libraries and has been tested with real-world data:
### **Performance Optimizations:**
- **Memory Efficient**: Processes songs in batches
- **Fast Matching**: Optimized algorithms for duplicate detection
- **Progress Indicators**: Real-time feedback for large operations
- **Scalable**: Handles libraries with 100,000+ songs
### **Real-World Results:**
- **Successfully processed**: 37,015 songs
- **Duplicate detection**: 12,424 duplicates identified (33.6% duplicate rate)
- **File type distribution**: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
- **Channel analysis**: 14,698 MP4s with defined priorities, 11,881 without
- **Processing time**: Optimized for large datasets with progress tracking
### **Space Savings Potential:**
- **Significant storage optimization** through intelligent duplicate removal
- **Quality preservation** by keeping highest priority versions
- **Complete metadata** for informed decision-making
---
**Happy karaoke organizing! 🎤🎵**
This project follows strict architectural principles:
- **Separation of Concerns**: Modular design with focused responsibilities
- **Constants and Enums**: Centralized configuration
- **Readability**: Self-documenting code with clear naming
- **Extensibility**: Designed for future growth
- **Refactorability**: Minimal coupling between components

View File

@ -123,7 +123,7 @@ def main():
songs = load_songs(args.input)
# Initialize components
matcher = SongMatcher(config)
matcher = SongMatcher(config, data_dir)
reporter = ReportGenerator(config)
print("\nStarting song analysis...")

View File

@ -21,17 +21,32 @@ from utils import (
find_mp3_pairs
)
try:
from preferences import PreferencesManager
PREFERENCES_AVAILABLE = True
except ImportError:
PREFERENCES_AVAILABLE = False
class SongMatcher:
"""Handles song matching and deduplication logic."""
def __init__(self, config: Dict[str, Any]):
def __init__(self, config: Dict[str, Any], data_dir: str = "../data"):
self.config = config
self.channel_priorities = config.get('channel_priorities', [])
self.case_sensitive = config.get('matching', {}).get('case_sensitive', False)
self.fuzzy_matching = config.get('matching', {}).get('fuzzy_matching', False)
self.fuzzy_threshold = config.get('matching', {}).get('fuzzy_threshold', 0.8)
# Initialize preferences manager
if PREFERENCES_AVAILABLE:
self.preferences_manager = PreferencesManager(data_dir)
if self.preferences_manager.has_preferences():
print(f"Using {self.preferences_manager.get_preference_count()} user priority preferences")
else:
self.preferences_manager = None
print("Warning: Preferences module not available")
# Warn if fuzzy matching is enabled but not available
if self.fuzzy_matching and not FUZZY_AVAILABLE:
print("Warning: Fuzzy matching is enabled but fuzzywuzzy is not installed.")
@ -174,11 +189,21 @@ class SongMatcher:
except ValueError:
return len(self.channel_priorities) # Lowest priority if channel not in config
def select_best_song(self, songs: List[Dict[str, Any]]) -> Tuple[Dict[str, Any], List[Dict[str, Any]]]:
def select_best_song(self, songs: List[Dict[str, Any]], artist: str = None, title: str = None) -> Tuple[Dict[str, Any], List[Dict[str, Any]]]:
"""Select the best song from a group of duplicates and return the rest as skips."""
if len(songs) == 1:
return songs[0], []
# Check for user priority preferences first
if self.preferences_manager and artist and title:
priority_order = self.preferences_manager.get_priority_order(artist, title)
if priority_order:
# Apply user preferences to reorder songs
songs = self.preferences_manager.apply_priority_order(songs, artist, title)
# Return the first song (highest priority) and the rest as skips
return songs[0], songs[1:]
# Fall back to default logic if no preferences
# Group songs into MP3 pairs and standalone files
grouped = find_mp3_pairs(songs)
@ -228,6 +253,9 @@ class SongMatcher:
}
for group_key, group_songs in groups.items():
# Parse the group key to get artist and title
artist, title = group_key.split('|', 1)
# Count file types
for song in group_songs:
ext = get_file_extension(song['path'])
@ -239,7 +267,7 @@ class SongMatcher:
stats['channel_breakdown'][channel] += 1
# Select best song and mark others for skipping
best_song, group_skips = self.select_best_song(group_songs)
best_song, group_skips = self.select_best_song(group_songs, artist, title)
best_songs.append(best_song)
if group_skips:

98
cli/preferences.py Normal file
View File

@ -0,0 +1,98 @@
#!/usr/bin/env python3
"""
Preferences Manager for Karaoke Song Library Cleanup Tool
Handles loading and applying user priority preferences for file selection.
"""
import json
import os
from typing import Dict, List, Optional, Tuple
from pathlib import Path
class PreferencesManager:
"""Manages user priority preferences for file selection."""
def __init__(self, data_dir: str = "../data"):
self.data_dir = Path(data_dir)
self.preferences_dir = self.data_dir / "preferences"
self.preferences_file = self.preferences_dir / "priority_preferences.json"
self._preferences: Dict[str, List[str]] = {}
self._load_preferences()
def _load_preferences(self) -> None:
"""Load priority preferences from file."""
try:
if self.preferences_file.exists():
with open(self.preferences_file, 'r', encoding='utf-8') as f:
self._preferences = json.load(f)
print(f"Loaded {len(self._preferences)} priority preferences")
else:
self._preferences = {}
print("No priority preferences found")
except Exception as e:
print(f"Warning: Could not load priority preferences: {e}")
self._preferences = {}
def get_priority_order(self, artist: str, title: str) -> Optional[List[str]]:
"""Get priority order for a specific song."""
song_key = f"{artist} - {title}"
return self._preferences.get(song_key)
def has_preferences(self) -> bool:
"""Check if any preferences exist."""
return len(self._preferences) > 0
def get_preference_count(self) -> int:
"""Get the number of songs with preferences."""
return len(self._preferences)
def apply_priority_order(self, files: List[Dict], artist: str, title: str) -> List[Dict]:
"""
Apply user priority preferences to reorder files.
Args:
files: List of file dictionaries with 'path' key
artist: Song artist
title: Song title
Returns:
Reordered list of files based on user preferences
"""
priority_order = self.get_priority_order(artist, title)
if not priority_order:
return files
# Create a mapping of path to file
file_map = {file['path']: file for file in files}
# Reorder files based on priority
reordered_files = []
used_paths = set()
# Add files in priority order
for path in priority_order:
if path in file_map:
reordered_files.append(file_map[path])
used_paths.add(path)
# Add any remaining files that weren't in the priority list
for file in files:
if file['path'] not in used_paths:
reordered_files.append(file)
return reordered_files
def get_preferences_summary(self) -> Dict:
"""Get a summary of current preferences."""
return {
'total_preferences': len(self._preferences),
'songs_with_preferences': list(self._preferences.keys()),
'preferences_file': str(self.preferences_file) if self.preferences_file.exists() else None
}
def create_preferences_manager(data_dir: str = "../data") -> PreferencesManager:
"""Factory function to create a preferences manager."""
return PreferencesManager(data_dir)

View File

@ -33,7 +33,7 @@ def test_basic_functionality():
print(f"Testing with sample of {len(sample_songs)} songs...")
# Initialize components
matcher = SongMatcher(config)
matcher = SongMatcher(config, data_dir)
reporter = ReportGenerator(config)
# Process sample

View File

@ -437,5 +437,98 @@ def download_mp3_songs():
download_name='mp3SongList.json'
)
@app.route('/api/save-priority-preferences', methods=['POST'])
def save_priority_preferences():
"""API endpoint to save user priority preferences."""
try:
data = request.get_json()
priority_changes = data.get('priority_changes', {})
if not priority_changes:
return jsonify({'error': 'No priority changes provided'}), 400
# Create preferences directory if it doesn't exist
preferences_dir = os.path.join(DATA_DIR, 'preferences')
os.makedirs(preferences_dir, exist_ok=True)
# Load existing preferences
preferences_file = os.path.join(preferences_dir, 'priority_preferences.json')
existing_preferences = {}
if os.path.exists(preferences_file):
with open(preferences_file, 'r', encoding='utf-8') as f:
existing_preferences = json.load(f)
# Update with new preferences
existing_preferences.update(priority_changes)
# Save updated preferences
with open(preferences_file, 'w', encoding='utf-8') as f:
json.dump(existing_preferences, f, indent=2, ensure_ascii=False)
# Create backup
backup_path = os.path.join(preferences_dir, f'priority_preferences_backup_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json')
with open(backup_path, 'w', encoding='utf-8') as f:
json.dump(existing_preferences, f, indent=2, ensure_ascii=False)
return jsonify({
'success': True,
'message': f'Saved {len(priority_changes)} priority preferences. Backup created at: {backup_path}',
'total_preferences': len(existing_preferences)
})
except Exception as e:
return jsonify({'error': f'Error saving priority preferences: {str(e)}'}), 500
@app.route('/api/reset-priority-preferences', methods=['POST'])
def reset_priority_preferences():
"""API endpoint to reset all priority preferences."""
try:
preferences_dir = os.path.join(DATA_DIR, 'preferences')
preferences_file = os.path.join(preferences_dir, 'priority_preferences.json')
if os.path.exists(preferences_file):
# Create backup before deletion
backup_path = os.path.join(preferences_dir, f'priority_preferences_reset_backup_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json')
import shutil
shutil.copy2(preferences_file, backup_path)
# Delete the preferences file
os.remove(preferences_file)
return jsonify({
'success': True,
'message': f'Priority preferences reset successfully. Backup created at: {backup_path}'
})
else:
return jsonify({
'success': True,
'message': 'No priority preferences found to reset'
})
except Exception as e:
return jsonify({'error': f'Error resetting priority preferences: {str(e)}'}), 500
@app.route('/api/load-priority-preferences')
def load_priority_preferences():
"""API endpoint to load current priority preferences."""
try:
preferences_file = os.path.join(DATA_DIR, 'preferences', 'priority_preferences.json')
if os.path.exists(preferences_file):
with open(preferences_file, 'r', encoding='utf-8') as f:
preferences = json.load(f)
return jsonify({
'success': True,
'preferences': preferences
})
else:
return jsonify({
'success': True,
'preferences': {}
})
except Exception as e:
return jsonify({'error': f'Error loading priority preferences: {str(e)}'}), 500
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=5000)

View File

@ -6,6 +6,7 @@
<title>Karaoke Duplicate Review - Web UI</title>
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet">
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet">
<script src="https://cdn.jsdelivr.net/npm/sortablejs@1.15.0/Sortable.min.js"></script>
<style>
.duplicate-card {
border-left: 4px solid #dc3545;
@ -62,6 +63,68 @@
font-size: 0.85rem;
word-break: break-all;
}
/* Drag and Drop Styles */
.sortable-list {
min-height: 50px;
}
.sortable-item {
cursor: grab;
transition: all 0.2s ease;
}
.sortable-item:hover {
transform: translateY(-2px);
box-shadow: 0 4px 8px rgba(0,0,0,0.1);
}
.sortable-item:active {
cursor: grabbing;
}
.sortable-ghost {
opacity: 0.5;
background-color: #e9ecef;
}
.sortable-chosen {
background-color: #fff3cd;
border: 2px dashed #ffc107;
}
.priority-indicator {
position: absolute;
top: 10px;
right: 10px;
background: #007bff;
color: white;
border-radius: 50%;
width: 24px;
height: 24px;
display: flex;
align-items: center;
justify-content: center;
font-size: 12px;
font-weight: bold;
}
.priority-1 { background: #28a745; }
.priority-2 { background: #17a2b8; }
.priority-3 { background: #ffc107; color: #212529; }
.priority-4 { background: #fd7e14; }
.priority-5 { background: #dc3545; }
.drag-handle {
cursor: grab;
color: #6c757d;
margin-right: 8px;
}
.drag-handle:hover {
color: #495057;
}
.priority-info {
background-color: #e7f3ff;
border: 1px solid #b3d9ff;
border-radius: 4px;
padding: 8px;
margin-bottom: 10px;
font-size: 0.9rem;
}
</style>
</head>
<body>
@ -222,6 +285,18 @@
</button>
</div>
</div>
<div class="row mt-2">
<div class="col-md-6">
<button class="btn btn-primary w-100" onclick="savePriorityPreferences()" id="save-priority-btn" disabled>
<i class="fas fa-sort"></i> Save Priority Preferences
</button>
</div>
<div class="col-md-6">
<button class="btn btn-info w-100" onclick="resetPriorityPreferences()">
<i class="fas fa-undo"></i> Reset Priorities
</button>
</div>
</div>
</div>
</div>
</div>
@ -320,11 +395,14 @@
let viewMode = 'all';
let pendingChanges = [];
let allArtists = [];
let priorityChanges = {};
let sortableInstances = [];
// Load data on page load
document.addEventListener('DOMContentLoaded', function() {
loadStats();
loadArtists();
loadPriorityPreferences();
loadDuplicates();
});
@ -558,6 +636,21 @@
}
}
async function loadPriorityPreferences() {
try {
const response = await fetch('/api/load-priority-preferences');
const data = await response.json();
if (data.success) {
priorityChanges = data.preferences;
updatePrioritySaveButton();
}
} catch (error) {
console.error('Error loading priority preferences:', error);
}
}
function changeViewMode() {
viewMode = document.getElementById('view-mode').value;
loadDuplicates(1);
@ -650,6 +743,11 @@
} else {
displayAllSongsView(duplicates);
}
// Initialize sortable for all duplicate groups
setTimeout(() => {
initializeSortable();
}, 100);
}
function displayArtistsView(duplicates) {
@ -697,6 +795,33 @@
// Create a safe ID by replacing special characters
const safeId = `${duplicate.artist} - ${duplicate.title}`.replace(/[^a-zA-Z0-9\s\-]/g, '_');
// Get current priority order for this song
const songKey = `${duplicate.artist} - ${duplicate.title}`;
const currentPriorities = priorityChanges[songKey] || [];
// Create all versions array (kept + skipped)
const allVersions = [
{
path: duplicate.kept_version,
file_type: getFileType(duplicate.kept_version),
channel: extractChannel(duplicate.kept_version),
is_kept: true
},
...duplicate.skipped_versions.map(v => ({...v, is_kept: false}))
];
// Apply current priority order if it exists
if (currentPriorities.length > 0) {
allVersions.sort((a, b) => {
const aIndex = currentPriorities.indexOf(a.path);
const bIndex = currentPriorities.indexOf(b.path);
if (aIndex === -1 && bIndex === -1) return 0;
if (aIndex === -1) return 1;
if (bIndex === -1) return 0;
return aIndex - bIndex;
});
}
return `
<div class="card duplicate-card">
<div class="card-header">
@ -713,40 +838,36 @@
</div>
</div>
<div class="card-body" id="details-${safeId}" style="display: none;">
<!-- Kept Version -->
<div class="row mb-3">
<div class="col">
<h6 class="text-success"><i class="fas fa-check-circle"></i> KEPT VERSION:</h6>
<div class="card kept-version">
<div class="card-body">
<div class="path-text">${duplicate.kept_version}</div>
<span class="badge bg-success file-type-badge">${getFileType(duplicate.kept_version)}</span>
<span class="badge bg-info channel-badge">${extractChannel(duplicate.kept_version)}</span>
</div>
</div>
</div>
<div class="priority-info">
<i class="fas fa-info-circle"></i>
<strong>Drag and drop to reorder file priorities.</strong>
The top file will be kept, others will be skipped.
Click "Save Priority Preferences" to apply these changes for future CLI runs.
</div>
<!-- Skipped Versions -->
<h6 class="text-danger"><i class="fas fa-times-circle"></i> SKIPPED VERSIONS (${duplicate.skipped_versions.length}):</h6>
${duplicate.skipped_versions.map(version => `
<div class="card skipped-version mb-2" data-path="${version.path}">
<div class="card-body">
<div class="d-flex justify-content-between align-items-start">
<div class="flex-grow-1">
<div class="path-text">${version.path}</div>
<span class="badge bg-danger file-type-badge">${version.file_type}</span>
<span class="badge bg-warning channel-badge">${version.channel}</span>
<!-- Sortable Versions List -->
<h6><i class="fas fa-sort"></i> FILE PRIORITIES (Drag to reorder):</h6>
<div class="sortable-list" id="sortable-${safeId}">
${allVersions.map((version, index) => `
<div class="card mb-2 sortable-item ${version.is_kept ? 'kept-version' : 'skipped-version'}"
data-path="${version.path}" data-index="${index}">
<div class="priority-indicator priority-${Math.min(index + 1, 5)}">${index + 1}</div>
<div class="card-body">
<div class="d-flex align-items-start">
<div class="drag-handle">
<i class="fas fa-grip-vertical"></i>
</div>
<div class="flex-grow-1">
<div class="path-text">${version.path}</div>
<span class="badge ${version.is_kept ? 'bg-success' : 'bg-danger'} file-type-badge">${version.file_type}</span>
<span class="badge ${version.is_kept ? 'bg-info' : 'bg-warning'} channel-badge">${version.channel}</span>
${version.is_kept ? '<span class="badge bg-success ms-1">KEPT</span>' : '<span class="badge bg-danger ms-1">SKIPPED</span>'}
</div>
</div>
<button class="btn btn-sm btn-outline-success ms-2"
onclick="toggleKeepFile('${safeId}', '${version.path}', '${duplicate.artist}', '${duplicate.title}', '${duplicate.kept_version}')"
title="Keep this file instead">
<i class="fas fa-check"></i> Keep
</button>
</div>
</div>
</div>
`).join('')}
`).join('')}
</div>
</div>
</div>
`;
@ -793,6 +914,126 @@
button.disabled = false;
}
}
// Priority Management Functions
function initializeSortable() {
// Destroy existing instances
sortableInstances.forEach(instance => instance.destroy());
sortableInstances = [];
// Initialize new sortable instances
document.querySelectorAll('.sortable-list').forEach(list => {
const sortable = Sortable.create(list, {
handle: '.drag-handle',
animation: 150,
ghostClass: 'sortable-ghost',
chosenClass: 'sortable-chosen',
onEnd: function(evt) {
const songKey = getSongKeyFromSortableList(evt.to);
updatePriorityOrder(songKey, evt.to);
updatePriorityIndicators(evt.to);
}
});
sortableInstances.push(sortable);
});
}
function getSongKeyFromSortableList(listElement) {
const detailsElement = listElement.closest('.card-body');
const cardElement = detailsElement.closest('.duplicate-card');
const titleElement = cardElement.querySelector('h6 strong');
return titleElement.textContent;
}
function updatePriorityOrder(songKey, listElement) {
const items = Array.from(listElement.querySelectorAll('.sortable-item'));
const newOrder = items.map(item => item.getAttribute('data-path'));
priorityChanges[songKey] = newOrder;
updatePrioritySaveButton();
}
function updatePriorityIndicators(listElement) {
const items = Array.from(listElement.querySelectorAll('.sortable-item'));
items.forEach((item, index) => {
const indicator = item.querySelector('.priority-indicator');
if (indicator) {
indicator.className = `priority-indicator priority-${Math.min(index + 1, 5)}`;
indicator.textContent = index + 1;
}
});
}
function updatePrioritySaveButton() {
const saveBtn = document.getElementById('save-priority-btn');
const hasChanges = Object.keys(priorityChanges).length > 0;
if (hasChanges) {
saveBtn.disabled = false;
saveBtn.textContent = `Save Priority Preferences (${Object.keys(priorityChanges).length} songs)`;
} else {
saveBtn.disabled = true;
saveBtn.textContent = 'Save Priority Preferences';
}
}
async function savePriorityPreferences() {
if (Object.keys(priorityChanges).length === 0) {
alert('No priority changes to save');
return;
}
try {
const response = await fetch('/api/save-priority-preferences', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
priority_changes: priorityChanges
})
});
const result = await response.json();
if (result.success) {
alert(`✅ Priority preferences saved successfully!\n\n${result.message}`);
priorityChanges = {};
updatePrioritySaveButton();
} else {
alert(`❌ Error: ${result.error}`);
}
} catch (error) {
console.error('Error saving priority preferences:', error);
alert('❌ Error saving priority preferences');
}
}
async function resetPriorityPreferences() {
if (confirm('Are you sure you want to reset all priority preferences? This will restore the default priority order.')) {
try {
const response = await fetch('/api/reset-priority-preferences', {
method: 'POST'
});
const result = await response.json();
if (result.success) {
alert('✅ Priority preferences reset successfully!');
priorityChanges = {};
updatePrioritySaveButton();
loadDuplicates(); // Refresh the display
} else {
alert(`❌ Error: ${result.error}`);
}
} catch (error) {
console.error('Error resetting priority preferences:', error);
alert('❌ Error resetting priority preferences');
}
}
}
</script>
</body>
</html>