KaraokeMerge/README.md

342 lines
12 KiB
Markdown

# Karaoke Song Library Cleanup Tool
A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.
## 🎯 Features
- **Smart Duplicate Detection**: Identifies duplicate songs by artist and title
- **MP3 Pairing Logic**: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
- **Multi-Format Support**: Handles MP3 and MP4 files with intelligent priority system
- **Channel Priority System**: Configurable priority for MP4 channels based on folder names in file paths
- **Non-Destructive**: Only generates skip lists - never deletes or moves files
- **Detailed Reporting**: Comprehensive statistics and analysis reports
- **Flexible Configuration**: Customizable matching rules and output options
- **Performance Optimized**: Handles large libraries (37,000+ songs) efficiently
- **Future-Ready**: Designed for easy expansion to web UI
## 📁 Project Structure
```
KaraokeMerge/
├── data/
│ ├── allSongs.json # Input: Your song library data
│ └── skipSongs.json # Output: Generated skip list
├── config/
│ └── config.json # Configuration settings
├── cli/
│ ├── main.py # Main CLI application
│ ├── matching.py # Song matching logic
│ ├── report.py # Report generation
│ └── utils.py # Utility functions
├── PRD.md # Product Requirements Document
└── README.md # This file
```
## 🚀 Quick Start
### Prerequisites
- Python 3.7 or higher
- Your karaoke song data in JSON format (see Data Format section)
### Installation
1. Clone or download this repository
2. Navigate to the project directory
3. Ensure your `data/allSongs.json` file is in place
### Basic Usage
```bash
# Run with default settings
python cli/main.py
# Enable verbose output
python cli/main.py --verbose
# Dry run (analyze without generating skip list)
python cli/main.py --dry-run
# Save detailed reports
python cli/main.py --save-reports
```
### Command Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `--config` | Path to configuration file | `../config/config.json` |
| `--input` | Path to input songs file | `../data/allSongs.json` |
| `--output-dir` | Directory for output files | `../data` |
| `--verbose, -v` | Enable verbose output | `False` |
| `--dry-run` | Analyze without generating skip list | `False` |
| `--save-reports` | Save detailed reports to files | `False` |
| `--show-config` | Show current configuration and exit | `False` |
## 📊 Data Format
### Input Format (`allSongs.json`)
Your song data should be a JSON array with objects containing at least these fields:
```json
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"disabled": false,
"favorite": false
}
]
```
### Output Format (`skipSongs.json`)
The tool generates a skip list with this structure:
```json
[
{
"path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
"reason": "duplicate",
"artist": "ACDC",
"title": "Shot In The Dark",
"kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
}
]
```
**Skip List Features:**
- **Metadata**: Each skip entry includes artist, title, and the path of the kept version
- **Reason Tracking**: Documents why each file was marked for skipping
- **Complete Information**: Provides full context for manual review if needed
## ⚙️ Configuration
Edit `config/config.json` to customize the tool's behavior:
### Channel Priorities (MP4 files)
```json
{
"channel_priorities": [
"Sing King Karaoke",
"KaraFun Karaoke",
"Stingray Karaoke"
]
}
```
**Note**: Channel priorities are now folder names found in the song's `path` property. The tool searches for these exact folder names within the file path to determine priority.
### Matching Settings
```json
{
"matching": {
"fuzzy_matching": false,
"fuzzy_threshold": 0.8,
"case_sensitive": false
}
}
```
### Output Settings
```json
{
"output": {
"verbose": false,
"include_reasons": true,
"max_duplicates_per_song": 10
}
}
```
## 📈 Understanding the Output
### Summary Report
- **Total songs processed**: Total number of songs analyzed
- **Unique songs found**: Number of unique artist-title combinations
- **Duplicates identified**: Number of duplicate songs found
- **File type breakdown**: Distribution across MP3, CDG, MP4 formats
- **Channel breakdown**: MP4 channel distribution (if applicable)
### Skip List
The generated `skipSongs.json` contains paths to files that should be skipped during future imports. Each entry includes:
- `path`: File path to skip
- `reason`: Why the file was marked for skipping (usually "duplicate")
## 🔧 Advanced Features
### Multi-Artist Handling
The tool automatically handles songs with multiple artists using various delimiters:
- `feat.`, `ft.`, `featuring`
- `&`, `and`
- `,`, `;`, `/`
### File Type Priority System
The tool uses a sophisticated priority system to select the best version of each song:
1. **MP4 files are always preferred** when available
- Searches for configured folder names within the file path
- Sorts by configured priority order (first in list = highest priority)
- Keeps the highest priority MP4 version
2. **CDG/MP3 pairs** are treated as single units
- Automatically pairs CDG and MP3 files with the same base filename
- Example: `song.cdg` + `song.mp3` = one complete karaoke song
- Only considered if no MP4 files exist for the same artist/title
3. **Standalone files** are lowest priority
- Standalone MP3 files (without matching CDG)
- Standalone CDG files (without matching MP3)
4. **Manual review candidates**
- Songs without matching folder names in channel priorities
- Ambiguous cases requiring human decision
### CDG/MP3 Pairing Logic
The tool automatically identifies and pairs CDG/MP3 files:
- **Base filename matching**: Files with identical names but different extensions
- **Single unit treatment**: Paired files are considered one complete karaoke song
- **Accurate duplicate detection**: Prevents treating paired files as separate duplicates
- **Proper priority handling**: Ensures complete songs compete fairly with MP4 versions
### Enhanced Analysis & Reporting
Use `--save-reports` to generate comprehensive analysis files:
**📊 Enhanced Reports:**
- `enhanced_summary_report.txt`: Comprehensive analysis with detailed statistics
- `channel_optimization_report.txt`: Channel priority optimization suggestions
- `duplicate_pattern_report.txt`: Duplicate pattern analysis by artist, title, and channel
- `actionable_insights_report.txt`: Recommendations and actionable insights
- `analysis_data.json`: Raw analysis data for further processing
**📋 Legacy Reports:**
- `summary_report.txt`: Basic overall statistics
- `duplicate_details.txt`: Detailed duplicate analysis (verbose mode only)
- `skip_list_summary.txt`: Skip list breakdown
- `skip_songs_detailed.json`: Full skip data with metadata
**🔍 Analysis Features:**
- **Pattern Analysis**: Identifies most duplicated artists, titles, and channels
- **Channel Optimization**: Suggests optimal channel priority order based on effectiveness
- **Storage Insights**: Quantifies space savings potential and duplicate distribution
- **Actionable Recommendations**: Provides specific suggestions for library optimization
## 🛠️ Development
### Project Structure for Expansion
The codebase is designed for easy expansion:
- **Modular Design**: Separate modules for matching, reporting, and utilities
- **Configuration-Driven**: Easy to modify behavior without code changes
- **Web UI Ready**: Structure supports future web interface development
### Adding New Features
1. **New File Formats**: Add extensions to `config.json`
2. **New Matching Rules**: Extend `SongMatcher` class in `matching.py`
3. **New Reports**: Add methods to `ReportGenerator` class
4. **Web UI**: Build on existing CLI structure
## 🎯 Current Status
### ✅ **Completed Features**
- **Core CLI Tool**: Fully functional with comprehensive duplicate detection
- **CDG/MP3 Pairing**: Intelligent pairing logic for accurate karaoke song handling
- **Channel Priority System**: Configurable MP4 channel priorities based on folder names
- **Skip List Generation**: Complete skip list with metadata and reasoning
- **Performance Optimization**: Handles large libraries (37,000+ songs) efficiently
- **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights
- **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions
### 🚀 **Ready for Use**
The tool is production-ready and has successfully processed a large karaoke library:
- Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
- Identified 33.6% duplicate rate with significant space savings potential
- Provided complete metadata for informed decision-making
- **Bug Fix**: Resolved duplicate entries in skip list generation
## 🔮 Future Roadmap
### Phase 2: Enhanced Analysis & Reporting ✅
- ✅ Generate detailed analysis reports (`--save-reports` functionality)
- ✅ Analyze MP4 files without channel priorities to suggest new folder names
- ✅ Create comprehensive duplicate analysis reports
- ✅ Add statistical insights and trends
- ✅ Pattern analysis and channel optimization suggestions
### Phase 3: Web Interface
- Interactive table/grid for duplicate review
- Embedded media player for preview
- Bulk actions and manual overrides
- Real-time configuration editing
- Manual review interface for ambiguous cases
### Phase 4: Advanced Features
- Audio fingerprinting for better duplicate detection
- Integration with karaoke software APIs
- Batch processing and automation
- Advanced fuzzy matching algorithms
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## 📝 License
This project is open source. Feel free to use, modify, and distribute according to your needs.
## 🆘 Troubleshooting
### Common Issues
**"File not found" errors**
- Ensure `data/allSongs.json` exists and is readable
- Check file paths in your song data
**"Invalid JSON" errors**
- Validate your JSON syntax using an online validator
- Check for missing commas or brackets
**Memory issues with large libraries**
- The tool is optimized for large datasets
- Consider running with `--dry-run` first to test
### Getting Help
1. Check the configuration with `python cli/main.py --show-config`
2. Run with `--verbose` for detailed output
3. Use `--dry-run` to test without generating files
## 📊 Performance & Results
The tool is optimized for large karaoke libraries and has been tested with real-world data:
### **Performance Optimizations:**
- **Memory Efficient**: Processes songs in batches
- **Fast Matching**: Optimized algorithms for duplicate detection
- **Progress Indicators**: Real-time feedback for large operations
- **Scalable**: Handles libraries with 100,000+ songs
### **Real-World Results:**
- **Successfully processed**: 37,015 songs
- **Duplicate detection**: 12,424 duplicates identified (33.6% duplicate rate)
- **File type distribution**: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
- **Channel analysis**: 14,698 MP4s with defined priorities, 11,881 without
- **Processing time**: Optimized for large datasets with progress tracking
### **Space Savings Potential:**
- **Significant storage optimization** through intelligent duplicate removal
- **Quality preservation** by keeping highest priority versions
- **Complete metadata** for informed decision-making
---
**Happy karaoke organizing! 🎤🎵**